Ruby is not known for being fast. That said, MRI (Matz’s Ruby Implementation, the original and most popular implementation of the Ruby language) has recently made great progress in this area, thanks to the dedicated work of many contributors.
To give you a quick sense of the scope of improvements recently, we’ve recently gotten the test suite for our main Rails application passing on Ruby 2.1, after having been on 1.9.3 for quite a while. As a result of this change, the time to run the full test suite for this app dropped almost in half (from 11:45 to 6:20).
One of the biggest recent improvements has been the addition of a partially-generational garbage collector in MRI 2.1, thanks to Koichi Sasada (more on this later). There’s also been good progress recently on exposing more detailed performance measurements from the garbage collector and the rest of the Ruby VM via the GC.stat and RubyVM.stat methods.
I’m pleased to announce that the latest version of the newrelic_rpm gem (3.8.0) takes advantage of these recent introspection improvements to give our Ruby customers some neat insights into how the Ruby VM is working for their applications. To demonstrate how these measurements can be useful, I’d like to show you how we used them to help understand the performance tradeoffs of various GC strategies using an example app.
Before we get into the example, though, let’s take a brief detour to give some context about garbage collection. I won’t go into too much detail here, because other folks including Aman Gupta, Koichi Sasada, and Sam Saffron have already done a great job, but a brief summary will help set the context. If you’re looking for more details, Aman Gupta’s post in particular gives a great historical overview of the GC’s history in Ruby (and has lots of pretty pictures to boot).
The garbage collector in Ruby is responsible for finding objects that were previously allocated, but are no longer reachable given the current state of a Ruby process, and reclaiming the memory that was used to store those objects.
At a high level, this work is comprised of two phases, usually called mark and sweep. During the mark phase, the garbage collector starts from a known set of ‘root’ objects (global and thread-local variables, constants, etc.) and follows all chains of reference out from there, ‘marking’ each object it encounters along the way to indicate that it should be preserved. If you’ve ever implemented a graph traversal algorithm, you’ve got the basic idea. During the sweep phase, objects that were not marked during the mark phase are destroyed, and the memory that was used to store them is made available again to the Ruby VM for new objects.
Because the mark phase must visit every live object in the Ruby process, its cost scales with the number of live objects in the process. To make matters worse, the garbage collector must pause the entire Ruby process while running the mark phase, in order to ensure that the object graph is not modified while being traversed.
In practice, most objects tend to fall into one of two categories:
- Long-lived objects allocated early in the process life-cycle that persist for the entire lifetime of the process; or
- Transient objects that are allocated and then quickly destroyed (for example, objects created during the servicing of a single web request)
This means that traversing the entire Ruby object graph on every GC mark phase turns out to be wasteful: many of the objects that are examined during each mark phase have existed since process startup, and will persist until the process exits.
Ruby 2.1′s garbage collector addresses this inefficiency by bucketing objects into two categories, or generations, often referred to as the ‘old generation’ and ‘young generation’, based on how long they’ve been alive. These two generations can then be examined on different schedules. Specifically, young generation objects are examined for GC much more frequently than old generation objects, since they’re more likely to be short-lived.
A GC mark phase that examines only the young generation objects is known as a minor GC, whereas one that examines all objects is known as a major GC. As you might expect, minor GC runs are significantly cheaper than major GC runs.
Out-of-band GC (‘OOBGC’) is a well-known technique that’s long been used to optimize GC for web applications. The basic idea is to encourage the garbage collector to run in between requests rather than during requests, so that the overall latency experienced by users is lower.
The New Relic application has used this technique for a while, and in a previous blog post, we discussed the bumps we hit along the way when enabling it. Aman Gupta also has a more recent post where he discusses how GitHub improved the performance of their app servers by using a slightly different OOBGC strategy better suited to Ruby 2.1.
The key to getting out-of-band GC right is to get GC to run frequently enough to keep the size of the Ruby heap under control, but not more often than it needs to, and not at times when it’s unlikely to be able to reclaim very many objects. This turns out to be tricky, so it’s critical to have measurement infrastructure in place to be able to get feedback from your production system as you adjust it. Let’s see how the New Relic’s Ruby agent can help with this.
With that background out of the way, let’s dig into the new measurements that are captured by the Ruby agent. In order to make it easier to see what’s going on, we’re going to be using an example application whose sole purpose in life is to allocate lots of Ruby objects during each request. We’re going to be running all of our tests under Ruby 2.1.1 in order to be able to see the effects of the generational garbage collector.
This is a contrived test case. While there are many things we can learn from contrived test cases, ‘what will work best for your particular application in production’ is not one of them. There’s no substitute for actual measurements from your production environment.
Let’s first take a look at the behavior of our test application running with a relatively stock setup. For this test and all of the following ones, I’ve setup the test Sinatra application using 4 unicorn workers, with an nginx frontend that’s routing incoming requests, and inserting the
X-Request-Start header, so that we can measure queue time in addition to Sinatra request processing time.
We’ll be using siege to drive load against our test application. For the purposes of these tests, I’ll be simulating 3 concurrent users with no delay between requests from each users with siege (using the
-c 3 and
-b flags), so our 4 unicorn workers will have headroom.
For this test, we collected data in four different configurations:
- Stock setup, with no adjustments to GC
Unicorn::OobGCwith GC disabled during requests, running GC manually after every request
Unicorn::OobGCwith GC disabled during requests, running GC manually after every 25 requests
GC::OOB::UnicornMiddlewarefrom Aman Gupta’s gctools gem
Let’s first take a look at the overview graph for these four different configurations (in the order listed above):
The four configurations described above are shown left-to-right on the graph. On the far left, you can see the stock configuration, with no changes to GC configuration. This is the only configuration in which a significant amount of GC time happens during requests (seen as the brown band on the overview graph).
The next section to the right shows what happens when we turn off GC during requests, and re-enable it and manually trigger a GC run after each request using
Unicorn::OobGC. Three interesting things to note here:
- The GC band on the overview graph disappears. This is because the overview graph is only showing us GC that happens during requests, not between requests.
- While the GC band disappears, the request queueing band expands in size dramatically. This is because GC runs are now happening in the time in between reqeusts, meaning that incoming requests have to wait longer for a unicorn worker to be available to service them.
- Perhaps most importantly, the overall response time gets dramatically worse with this approach.
Moving on to the third section from the left, we adjusted the
Unicorn::OobGC configuration to run after every 25 requests, rather than after every single request. This yields an overall improvement in response time from the stock configuration, though it comes at a price that we’ll see in a moment.
The final section of the graph shows the impact of using Aman Gupta’s new
gctools gem to do out-of-band GC, without disabling GC during requests (specifically, we’re using
GC::OOB::UnicornMiddleware). This gem makes use of new hooks exposed by the MRI 2.1+ VM to proactively trigger GC runs in between requests, but only when they’re likely to be useful. This approach yields us a decent response time improvement from our original stock configuration.
Now, let’s use the Ruby VMs tab to understand what’s going on in more detail here.
The first thing to note is that the number of objects being allocated per request remains constant throughout the entire test. You can see this in the graph of object allocations per request – it’s a flat line:
It’s important to realize that changes in garbage collector behavior can result from either configuration changes (e.g. switching to a new OOBGC strategy) or from alterations to code that significantly affect the number or type of objects being created. The flat line in the object allocations graph confirms that we didn’t accidentally affect how many objects were being created during our tests in addition to changing the configuration.
Next, let’s look at the memory usage graph:
You can see that the average memory usage per process went down slightly when we started running GC after every single request. This makes sense, since we were forcing GC to run and reclaim unused memory more aggressively.
This graph also demonstrates the hidden cost of our third approach (disabling GC during requests, and only running it every 25 requests): memory usage more than quadruples, since we’re not reclaiming unused objects as aggressively.
This graph of the Ruby heap size shows a similar pattern:
The amber band represents a count of the number of live Ruby objects, and the blue band shows the number of available heap slots. All of our OOBGC configurations leave us with more free heap slots on average than the stock configuration.
Let’s see how frequently GC is running in each scenario:
In the stock configuration and with gctools, we have very few major GC runs – almost all of our GCs are minor GC runs (that is, they only need to examine a subset of the heap). The cases where we need to examine the entire heap are so rare that the band for major GC runs is barely visible. When using
Unicorn::OobGC to run GC after every request, all of the runs we see become major GC runs.
This is because
GC.start to trigger GC runs, and on Ruby 2.1, this will default to triggering a major (rather than minor) GC run. Once we switch over to gctools, we’re running a minor GC slightly more often than the stock config – after every request (you can see this in that the frequency of minor GC runs bumps up to 100 per 100 requests).
The fact that we’re hitting minor GC runs more often might at first seem like a disadvantage, but it’s important to also consider when these minor GC runs happen. In the stock configuration, they might happen between requests or during a request. In the gctools config, we’re pre-emptively triggering GC between requests under certain circumstances, meaning that GC is much less likely to trigger during the middle of a request, making our overall response time better.
Finally, let’s consider the following graph, showing the average amount of wall clock time spent in GC by each Ruby process serving our app:
The most obvious feature here is the dramatic increase in time spent in GC when we’re running a full major GC after every request: from < 10% to almost 60%. It’s also interesting to note that using gctools in this test increased the wall clock time spent in GC versus a stock configuration. This ended up being OK because of the fact that our unicorn workers weren’t fully saturated with requests (recall that we only were simulating 3 concurrent users, but we had 4 unicorn workers). If our workers were saturated with requests, this higher percentage of wall clock time in GC would be more problematic.
If you’re using Ruby 2.1, and you have headroom in your worker capacity (that is, you have enough workers that they aren’t always saturated with requests), you should consider using gctools to do out-of-band GC. It appears to offer a decent improvement in response time without bloating your memory usage.
As mentioned previously, though, this is just a contrived test application. The behavior of these strategies in your application under your workload on your servers may be completely different. The only sure bet to determine of what’s best for your application is to actually try several approaches and compare the results. We hope that the latest version of the Ruby agent simplifies doing these kinds of experiments.
To get access to the new Ruby VMs page, just update to version 3.8.0 or later of the newrelic_rpm gem (you’ll need to be using Ruby 1.9 or later to get data on this page). You can find more documentation about the feature on our docs site, and on our docs page about garbage collection.
Finally, do let us know what you think of the Ruby VM data that the latest agent collects. Have you found it useful? What more do you wish it did? We’d love to hear from you in our community forum.