There’s been a lot of talk about the improvements in Ruby 2.0’s Garbage Collection performance. That’s a great thing, because GC in older versions of Ruby wasn’t very performant.
Most of the conversation about generational GC (or other strategies) revolves around making Ruby GC more efficient. This has a great impact on your users, because less GC time means that their pages return faster. But what if you could avoid it altogether?
With Unicorn’s out-of-band GC, you can. As they say:
“We’ll call GC after each request is written out to the socket, so the client never sees the extra GC hit it.”
It looks easy enough, too. All you have to do is add this to your ‘config.ru':
There are a couple of tricks to it, though. There’s an optional second parameter which sets the frequency of GC sweeps. It defaults to five, meaning a GC sweep will be executed by the Unicorn worker after it returns every fifth request to the user. This sounds fantastic, right? The user no longer sees any GC, because you’re waiting for the request to return to them before it happens.
We thought to ourselves, “Cool! Ship it!” So we did.
And Then Our Servers Fell Over
The next morning, our on-call DevOps engineer got a 3 am wake up call. All of our servers were burning up — they were pegged near 100% CPU and our queue time was wa-a-a-ay up. After he reverted the changes, everything went back to normal.
But there was something fishy about this whole thing. During this entire time, we were still seeing GC time in our breakdown chart.
What’s Really Going On?
I took another look at the code. It turns out that this was all the OobGC’s doing:
super(client) # Unicorn::HttpServer#process_client
if OOBGC_PATH =~ OOBGC_ENV[PATH_INFO] && ((@@nr -= 1) <= 0)
@@nr = OOBGC_INTERVAL
disabled = GC.enable
GC.disable if disabled
Let’s unpack that a bit. ‘@@nr’ is just the counter that’s working on the interval between requests. So if we get to the fifth request, we turn on GC and run a GC sweep. If GC was off to begin with, we turn it back off again. That’s pretty simple. It looks funky because ‘GC.enable’ is returning ‘true’ if GC had previously been disabled.
And then it hit me — we never disabled GC!
We Were Doing It Wrong
Because we had never disabled GC, we were running it as normal, plus forcing a GC sweep every five requests. This was what killed our servers. They were burning all their CPU running GC sweeps for the dozens of Unicorn workers they had and the workers were so busy doing GC that they were unavailable to answer requests, which spiked our queue time.
Sigh. Lesson learned.
Doing It Right
We tried this instead:
[code language="ruby"]GC_FREQUENCY = 40
GC.disable # Don't run GC during requests
use Unicorn::OobGC, GC_FREQUENCY # Only GC once every GC_FREQUENCY requests
And it worked much, much better.
Look at that! When we deploy this config, the brown-colored GC just disappears. It’s a thing of beauty. But I bet you’ve got a few questions.
How Do You Prevent the Workers from Getting Huge?
Nobody likes a big bloated Unicorn. Let’s say one worker has 40 really large requests in a row. It could get to several GB in size before any GC happens. That could take down our servers by pushing us into swap. Ideally, there would be some way to monitor the workers and kill them off if they get too big.
You could use monit for that, but ew. Plus monit would just kill them off mid-request, which would be terrible for the user.
It’s less of a Slayer and more of a Unicorn suicide module. What it does is have each worker check its RAM usage after every nth request, and if it’s over the limit, send itself a SIGQUIT. The check frequency is configurable. Here’s ours:
use(UnicornSlayer::Oom, ((1_024 + Random.rand(512)) * 1_024), 1)
So now we have a safety valve. If any of our workers get too big, we’ll shut them down to prevent them from taking up too much RAM. We check every request, because the check is very, very cheap.
How Do We Monitor the Slayer?
It turns out that having a Unicorn slayer is great, but it can cause other problems. If we’re slaying too often, we’ll have a lot of restarts. When a Unicorn restarts, that takes time — time that the worker is unavailable. That can increase your request queuing time.
It’s a good thing we have good performance monitoring — the capacity analysis can help us keep tabs on this:
This chart shows how often the workers are restarting. The big spike is from the deploy, which makes sense. All of the workers restart with the new code. But there are some blips after that. Every few minutes we’re restarting a Unicorn or two. That’s a bit more than I might like, but for now I can live with it.
In the end, we just have to balance RAM consumption on the servers and restart frequency. If you use more RAM, you restart less often. If you use less RAM, you restart more often. Tuning this isn’t particularly painful, it just takes time and is highly dependent on your server configuration. In our case, it’s a bit complicated because our seven app servers have three different hardware configurations (but that’s a story for another day).
In the End, Our Users Win
Now you know the ‘gotchas’ with
Unicorn::OobGC. Remember to turn off GC completely and make sure you have the Unicorn Slayer configured. With that, you should be able to get rid of the GC time from your requests and regain anything from 10 – 100ms, on average, from your response time. That’s how your users win. We definitely saw the benefit in our in browser response times.
Do you have any experiences with Garbage Collection in Ruby 2.0 or Unicorn Slayer? Tell us about them in the comments below.