ION08 Conference – Scaling on a Dime Panel

by Darius Kazemi on May 15, 2008

in conferences,infrastructure,ION 2008

These are my notes from the Scaling On A Dime panel at ION 2008. The participants (along with their letter abbreviations) were as follows:

Marty Poulin (M) (moderator)
Joe Ludwig (J)
Victor Jimenez (V)
Brian Hafer (B)
Larry Mellon (L)

Any mistakes or misinterpretations are my own damn fault. Don’t blame these guys. My comments in square brackets.

M: Intros?

L: Been in distributed computing for over 20 years, research programmer for 10 years in early days of microcomputer and ethernet, found taht dev challenges were hard, so focused on tools, supercomputing, virtual world researcher for DARPA, then joined EA and worked on The Sims Online by bringing military production expertise in. Then cofounded Emergent Game Technologies.

J: Been at Flying Labs for 8.5 years, all the way through to production of Pirates of the Burning Sea.

B: Got into online long before games, worked at startups before working at Emergent on metrics product, currently at Tubemogul dealing with rapid scaling issues for online service in web 2.0 space.

V: Work for Northrop Grummond. Does distributed tech for simulations, not interested in the dime since it’s govt, but TIME is a big factor in implementation.
M: The reason we’re talking about scaling on a dime instead of random testing stuff is that there’s a new project lifecycle: prototype, beta, iterate til traction, scale or fail! Minimize time and costs, ensure scale, and have good quality of service. Have to scale people as well as infrastructure: devs, tools, pipeline, process. Licensing vs development: open source (OS) and middleware pros and cons. OS/MW can be faster/cheaper, might be better than what you can build, reduced maintenance, but also could give vendor dependence.

What are the processes you use to pick middleware or OS?

L: Sometimes different pieces of middleware are built with different assumptions in mind, so integration errors become a very big problem as we use more and more pieces of software. do your scaling during testing as opposed to scaling live with users. For OS, find groups with momentum and not much politics. You want to know where your middleware is going to be not just now but in 5 years when you launch.

J: We had a similar problem. Our graphics engine provider went bankrupt 6 months after we licensed them. We thought we were close to launch, so we stuck with them, game turned out to take 4.5 more years all the while with no vendor in existence! You need to do due diligence with vendors, and also do code escrow so you can get source code if they go out of business.

B: In the OS space, there are new projects every day, you can always find what seems to be a perfect fit niche project, but they won’t necessarily be mature and robust. Look at the reliability record. Public release or beta software? I’d recommend not adopting anything except the more well-backed, tried-and-true projects.

L: I got burned once by not investigating the testing practices of the OS project, turned out that the engineer just ran it a few times on his desktop to test.

V: We find we have very long term projects, 15-20 years [wow!]. When we evaluate the project, we have to look at ease of use of package, documentation, etc. Not so obvious is that you have to look at the metaphor behind it: if the metaphor doesn’t fit you end up writing a lot of glue code. Find a project that is going in the same direction as you. You may go through lots of evals but in the long run it really pays off. Make sure it’s multithreaded before you license it! Other minor ones: make sure the source you’re getting from OS is actually readable so you can debug.

J: Unreadable source is not unique to OS. We licensed a commercial graphics engine that had copy and paste code with one line in each copy that was 1600 characters long!

B: Even best case if the source is good, having to track down bugs is still difficult and a sinkhole of time and money trying to track down bugs because you may not understand the architecture. Source code is not a silver bullet.

J: But you don’t usually get source code to change the middleware. You get it to debug or trace: when you get a middleware error it’s usually your fault but you still have to be able to trace the code to learn what you did wrong.

L: This is why you need test rigs for the entire system as well as individual modules so you can very easily track down the source of the problem.

M: It’s not only about whether the code is good, but what about support? Any nightmares?

L: At EA we bought a very expensive build distribution package. 8 builds being sent around to about 120 devs, 200 QA, and a bunch of executives. We were pushing so much data around at such high rates, they’d never anticipated this in the distribution package so it couldn’t handle it. We had to reengineer their system on the fly!

J: Our support has been pretty good, but before we started on PotBS and were evaluating different engines. One middleware provider asked to visit to evaluate the engine. We said, “Okay, let’s go this month.” They replied “we’re too busy this month shipping Unreal Tournament. You have to visit next month.” Big red flag that they won’t support us in a time of need. [Epic has since hired a whole department to support middleware so in theory this doesn't happen anymore, but see the Silicon Knights lawsuit for claims to the contrary.]

M: What about legal issues with licensing?

V: I work for the government. Part of problem is that the gov’t noticed that games are an interesting technology. Serious games, RFIs, RFPs, etc that deal with game technology. That’s a new infusion of money and talent, great. However, gov’t has very interesting ideas about what intellectual property (IP) means. one of the things that happens to people that has caught many unawares is that if the gov’t gives you money and you develop something with their money it is now the government’s IP to do as they will. So you make your MMO, spend $3M of your money on it, gov’t gives you $200k to put in one feature, you deliver, and the gov’t takes the whole thing and distributes it to anybody they want. The gov’t says, “Hey, your IP is infected with your rights now. Nyah nyah.” Catches large companies as well as small.

L: Licensing issue in OS is slowing down its adoption. I’ve had to turn down promising packages with how the license infects the rest of our middleware. I’ve heard of game teams that had to pull out OS from games because lawyers are afraid of launchtime security exploits due to problems with the OS paackage having exposed source [which is kind of weird, but whatev].

M: What do you like in OS?

L: Simple things. GCC. Anything that’s a tool is very safe to use. Lawyers concerned with software to release to customers, but in-house is good. [what about in-house infrastructure that is part of the production environment, such as any messaging systems etc?]

J: Not so much OS, but we like PathEngine, Rad Game Tools on the middleware side.

M: How do we actually get things to scale? Selecting components is fine but it doesn’t mean they’re going to scale to the levels we’re going to push them to. What role does testing play in the evaluation of this software?

L: I’ve grown to become a big proponent of testing. Started with automated testing in the 80s on supercomputing. Talked about test rigs for components, but you also need automation of testing of the system as a whole. The middleware network transport we used had an individual test that showed it worked under 15k clients under normal loads, but integrated with the whole system only supported 200 clients!!

J: Testing has worked well for us. We have a test dev who does nothing but test tools and automation. We learned from Larry’s GDC talk [in 2003]. We have lots of test servers identical to production servers but with client bots that hit them with load.

V: From day one we were an extreme programming shop. Always follow test first philosophy. Every single line of code has a passfail unit test, run through the test cases. We have a suite of tests which involves people manually running through stuff in specialized virtual cockpits, and above that we actual operational equipment to test on. By the time it leaves the lab it’s well tested. Too many times we’ve been bitten by the bug that when you get OS or middleware that is buggy and won’t be fixed for a while by provider. Don’t trust the provider.

L: Test suites are great but they’re still code, so there can be bugs in your test suites. Calibrate the accuracy of the test system and the game itself. Measure level of nondeterminism in your test system gives you an idea of the accuracy of your tests!! [dude hells yeah]

M: What are some rules to fix the scaling problems?

J: Avoid single points of failure, obviously. We try to make it so we can bring additional resources to bear on any of the problems we face. Front line that talks to the client are certain servers that they can scale independently and also increases reliability because they’re independent processes.

B: Build a service oriented architecture rather than monolithic tightly coupled system. Be sure to define your interface points and where your components will talk to each other so you can scale over processing power and then problem become data access shared across tons of users (high availability).

L: Metrics can help find where that bottleneck is and who’s using the data, how often, what frequency, to see if it’s a time-dependent bottleneck (which is harder) or not. You can architect around these bottlenecks if you understand them.

J: We have things on PotBS that are a little more monolithic than they should be. Limits our ability to scale. We have to have more clusters than we might otherwise need because some systems need to be just so. Very hard to do this stuff after you’re up and running.

M: Coupling is a lesson learned by experienced engineers.

B: Verify your engineers actually follow your architecture plans!!

L: We forced our engineers to avoid coupling. They broke the architecture because they went through DLL boundaries.

J: Have a test mode where nothing is ever on the same server so that every call that could be a remote process is forced to go to be a remote process, that way you have to architect it right.

M: Up to this point when were putting together server architectures for MMOs, in order to figure out our infrastructure we have to take and educated guess. Look at competition and market size, interest that’s out there, and we make a guess. We hope that when people arrive that we’re right. If we’re wrong, we either spent too much money and go out of business, or we didn’t scale enough and our services are not working and we’re out of business.

Animoto, a video service, was fine with 50 server instances and then overnight they shot up to 3000 server instances! Any rules of thumb to figure out what you’re shooting for?

L: Run extended load test derived from actual beta test numbers. But you can’t really guess how many users you’re going to get.

J: We can take a pretty good guess as to how many users a cluster can support but we can’t predict the number of users.

M: We’ve all had experience with the epic fail. From my standpoint, when we released we didn’t even have the right tested bandwidth for WW2 Online.

L: We had the opposite failure. We hit 100k players in 2 or 3 months but we’d projected 1M users and bought hardware for 1M users in 4 months.

V: I know ahead of time how many users I will have. I know where they’re located, etc. But I never really know what they’re going to do. They’re virtual cockpits for new airplanes and we literally don’t know what they’re going to do. Govt wants it to work perfectly but I don’t know what the bandwidth or processing requirements are, so I have to guess from the plane’s features what the requirements will be.

M: With no good user profiling you can’t guess that.

L: you can actually change the game to tune the user behavior to lower server requirements.

J: Make the game less fun to perform better? A an alternate example is EQ1 got a feature where you could keep your character online as a vendor, but that meant there was WAY MORE CONCURRENCY but didn’t result in more play time.

M: New thing on horizon is the idea of scaling via cloud computing. Google, MS, Joyent, Amazon Web Services EC2 is premier right now. But lots of tradeoffs. Great because you can scale by throwing money at the problem, initially more services for less money, easy to set up. But you have less contrtol, sometimes QOS issues, higher costs at scale. What makes using cloud computing system suitable for a particular business?

J: One of the thing that people talk about is you depend on your provider to be up all the time. For certain businesses–Amazon just had many hours of downtime on S3 [yikes!], so people are concerned about uptime. Fact is that the uptime of amazon or google will completely blow away uptime of random game studio x, so for small studio not an issue, but for EA it might matter since EA can have very good uptime.

B: Size and resources drive the decision. If you are small or moderate sized, you can get great bang for buck and better reliability. But if you’re huge like EA, amazon is charging a markup, you’d do better with your own economy of scale and get more control over your operations. We use amazon at my current company to support scaling. It’s very reliable, but there is no legal QOS guarantee. Have been instances such as with S3 that took even amazon by surprise. They were unprepared to deal with it. They were not prepared to communicate what was going on. But you can’t beat it on the cost model.

L: Unless something goes wrong. Some guy got hacked, his BW bill went up–

B: This was a friend of mine who built a service had an exploit so someone was using his instances to run rogue processes, got saddled with $60k bill from Amazon at the end of month!

L: When using these things, put your own triggers and alarms on service usage.

M: Ensure your monetization model supports scale. more users may just put you out of business.

J: On the other hand, with Amazon in particular, you or a piece of code you control has to ask for extra resources. If you’re willing to let QOS suffer or have users in queues to keep costs down, you can do that if you want.

M: Typically with mmos, rather than have bad QOS we put people in the queue. It says to customers not “we are bad” but “we are not available”.

J: If there are flash crowds or certain server resources that being taxed, we will queue people to get around that.

L: If you are trying to scale on a dime, maybe avoid hard problems like a seamless virtual world from the very start.

M: each service has a different api, a different model, so to a certain extent there’s vender lockin.

L: We use presentation layers where I expose what I need to have and I can write glue code, and evaluate new vendors.

B: With Google you’re tied into their API, but with Amazon it’s really more of a virtual machine you’re running. VM options are very flexible.

J: But there are tradeoffs. If you’re using VMs with Amazon, you have to have an operations staff and you own the installs for the operating systems. With Ggoogle appengine it’s higher layer. You’re writing python and it’s running on their infrastructure that you don’t care about. You have to deploy builds but not operate hundreds of VMs.

M: This brings about complexity issue. If we have 3000 VMs, is it any easier to go with a service than managing real or virtual servers yourself?

J: It’s a lot easier to go with Amazon because you can get a new VM set up in 5 minutes on a web form, as opposed to physically installing a new machine at the data center and all the associated overhead.

B: Services like Amazon force you to have horizontally scalable architecture right from the beginning. If you’re not architected right, more VMs won’t help you at all.

[Audience questions follow]

Q: Regarding scalability and quality. have any of you built a solution in the cloud that overcomes issues where servers die and you spin up a new one without users noticing?

B: We use amazon EC2, and you have to deal with instance failure. Minimize what you’re storing on an instance. Use instances for computing power rather than data storage. S3 is permanent long-term, They’re launching something that’s more like a local hard disk in the future. If you’re storing data on your disk, if you’re running a DB server on EC2, you have serious problems.

Q: What if one part of system is self hosted, one is in cloud? What about latency?

B: We’re not seeing a lot of latency problems between us and the cloud. We run data locally and processing in the cloud. You can put data in the cloud but you have to architect more robustly.

Q: You said infrastructure design is hard to do up front. What about heuristics on the best things to do to prepare for scaling? How do you plan for user peak?

J: More of a marketing question that they can’t answer very well. Preorder programs can’t really help.

M: In the end, nobody knows.

J: For traditional mmos, you look at preorder and 2.6x that for initial launch week population. But your preorder program can be too high or too low for a wide variety of reasons. We had difficulty getting preorder boxes on shelves, so our preorder was delayed by a month so we couldn’t really tell. If you have to go one way or other: underbuy resources. Lines out the door are better than a big empty building.

M: One last point: as you scale with Amazon, they’re competitive at lower scale, but as you scale up Amazon gets more expensive and you’ll be better off doing it yoursel.

J: Brian, do you have cost numbers for EC2?

B: Check out their website, it’s up there.

Comments on this entry are closed.

Previous post:

Next post: