Rule #1: Downtime is No Fun
In less than two years, wooga (world of gaming) has become the biggest developer of social games in Europe. Their engineering is guided by the philosophy that social games are a service – not products – and a keen focus on emotional characters, excellent usability, and effective localization in seven languages. By utilizing the social graph in game-play design elements, constantly testing and improving games, and releasing new updates every week, wooga has created some of the most popular games on Facebook. Monster World, with more than 1.1 million daily active users (DAU), is a Ruby on Rails application with a MySQL/Redis backend that’s hosted on Amazon EC2. Brain Buddies (200,000+ DAU) and Bubble Island (another 1 million DAU) are both run on PHP with MySQL at Slicehost. Its newest game, Happy Hospital (already over 200,000 DAU), is a RoR app with a Redis backend hosted on dedicated machines at Hetzner.
Brian Doll checks in with Jesper Richter-Reichhelm, Head of Engineering at Wooga:
The games’ incredible popularity and the diversity of their environments pose several challenges for the engineering team. With so many users from around the world looking for a much-needed diversion, wooga can’t afford downtime. Yet, the sheer volume of traffic – 5,000 requests per second and over 200 million daily requests for one game – makes reliable performance both exacting and essential. Because the majority of requests involve updating states in the database, an uncommon 1:1 read/write ratio makes standard caching techniques largely ineffective. As head of engineering Jesper Richter-Reichhelm explains, “The load on our backends makes it impossible to use standard profilers on the live application. It’s also hard to predict live traffic, and usage profiles change constantly as a result of weekly software updates. Plus, each game team only has 2-3 engineers working on the server, and they’re doing it all: architecture, development, and operation.” Rigorous user experience requirements, heavy yet volatile traffic, variable usage dynamics, and resource constraints all place intense demands on an application performance management (APM) solution. It has to enable monitoring of live applications without significantly impacting performance, while also providing the greatest amount of insight with the least manual effort.
The Key to Victory
Early on, wooga monitored performance by using Ganglia to collect some hard numbers and their own software to follow key performance indicators. But they weren’t able to measure overall application performance, so they started looking for a complete solution that was effective for Ruby on Rails. It quickly became clear that only New Relic RPM offered all the capabilities they required. After a fast, easy implementation, wooga was finding relief from some of their pain points. Identifying the source of performance bottlenecks is now no longer a problem. And, with the custom view and custom trace features available in their Gold edition of RPM, wooga can easily see important tracking data in real time. “One of our web transactions has an average response time of 250ms,” says Richer-Reichhelm. “But RPM revealed that some calls took longer than 10 seconds for the same code. By following the trace, we pinpointed the root cause. It was clearly visible in the performance breakdown. We were even able to identify the affected users. It turned out to result from an extreme usage pattern we weren’t aware of before. So not only did we detect and solve a serious performance problem, but we gained great insights into how the social graph of our application worked as well.”
Jesper’s team can also see exactly how applications are behaving in live traffic – not just how often a transaction is called, but how long it spends on a database query. In fact, visibility into SQL query parameters for slow transactions helped them resolve some hard-to-detect problems that only affected a few users, but had tremendous consequences for those impacted. The weekly releases don’t cause headaches anymore, either. Comparing new software against prior versions is simple because RPM’s performance breakdown shows if an update has changed throughput or response times for the entire application, individual transactions, or database queries. “In a new release just a few weeks ago,” recalls Richer-Reichhelm, “there was one transaction with a query on a newly created table that wasn’t using an index. But we didn’t notice it at first because the query was still quite fast. As more data was added to the table, however, the application became much slower. We eventually saw the average MySQL response time had more than doubled. By comparing web transaction and database response times to the previous week, we quickly spotted the culprit and added the missing index. Then we used RPM to circle back and verify that the query was executing properly.”
The benefits aren’t limited to wooga’s Rails apps; they can now monitor all their different application environments using a single toolset. They’ve even begun comparing the various setups by using metrics from one app to investigate the others. In one case, that led the engineers to notice response times for Redis calls in Happy Hospital’s dedicated server environment were 8 times lower than the same calls in Monster World’s EC2 environment. This head-to-head appraisal allowed them to directly compare the two hosting solutions with live traffic instead of artificial tests. “RPM is the only tool we use with every one of our applications, which are very dissimilar. It’s truly invaluable to us.”
Advancing to the Next Level
With an ability to consistently identify bottlenecks and better understand the behavior of both applications and users, wooga has achieved results that translate into business growth. Resolving performance issues has boosted the overall capacity of their cluster by as much as 20% in some cases, and Richer-Reichhelm estimates they’ve increased overall throughput “by at least 30%, with equivalent reductions in hosting costs.” Scalability improvements propelled Monster World from 300,000 DAU to 1.1 million. “We could not have gotten Monster World to over 1 million daily users without New Relic RPM,” he says, “and the number of daily users correlates directly to the income it generates. It’s the same story with Happy Hospital: the solution enabled us to spot problems easily – an early on – once again.” Moving forward, wooga plans to keep its game teams small and agile. But with such large-scale applications, they need to be able to get the most out of their software and hardware. New Relic RPM allows them to do just that because “it’s simple to use but still very powerful. It’s already a standard tool for our company, and we’ll likely use it for all new applications whenever possible. For operating and tuning applications, it’s a real game changer. I don’t want to run another project without using RPM.”
Get in the Game
Join the social fun by finding wooga’s games on Facebook. Want to explore the fast-growing world of gaming in depth? Think you’ve got the high scores to become part of Europe’s leading social games developer? Check them out at www.wooga.com to power up your curiosity or career.
If you’re an AWS customer like wooga, you get New Relic Standard for free. Go to http://www.newrelic.com/aws to sign up!