‹ Blog Home

Eating the 1.9 Elephant

A few weeks ago, we switched the New Relic website to run on Ruby 1.9.3. This was an enormous project spanning many months and required the effort of nearly every engineer in the company. But the results were excellent – improved speed, reduced memory usage and an infrastructure ready for future Ruby versions.

Moving such a large and long-lived Rails 2.3 application as New Relic required a very careful and thorough approach. Nearly every aspect of how our site is tested, deployed and ran was affected. We learned a lot during the process and want to share the most important lessons.

It’s Not Just Debt
On a codebase as large as New Relic, the engineering time required to upgrade to 1.9 was large enough to treat as a serious feature. As we later learned, the performance improvements were significant enough to take it very seriously indeed.

Work to get the New Relic code ready for 1.9 began years ago, before I joined the company. Taking it from exploratory to production ready took eight months of calendar time, distributed among several engineers for different upgrade tasks.

When you get started with such a migration, it’s important to assign at least one engineer with the task of chasing down every dependency and weird bug until it’s shipped.

Use a Ruby Version Management Tool
Using a Ruby version management tool helped us in all stages of the upgrade. Both RVM and rbenv are excellent tools for managing and installing Ruby versions all the way from the laptop to the server. We went with rbenv for our servers for the simple installation and used the puppet-rbenv module to set it up. It’s worth deciding on which tool to use early on in the process and getting everyone on the team comfortable with it.

Once you’ve picked one, you can use it to do cross version testing on laptops, set up multi-version build configurations on your CI server, and easily change patchlevels or major versions on your production servers.

During our upgrade process, there was a long season where the codebase had to be cross compatible between 1.8 and 1.9. By making version switching easy, we ensured that it wasn’t too huge a burden.

Make Your Test Server Do Most of the Work
We use Jenkins to build our code every time a push occurs. So to get started, we simply added a build that ran our tests in Ruby 1.9. With the first build we had over 200 failing tests. But that gave us a target. We held some bug bashes to help drive down the failures, then made fixing the ones that remained a sustaining task like any other. This wasn’t a fast process, but it was sustainable. And it allowed the upgrade to fit into our existing bug tracking and test process.

But Tests Aren’t Everything
It was a very exciting day when our 1.9 test job went green. The first thing I did was switch my laptop to 1.9 and try to run the site. It didn’t work at all. Whoops!

Turns out there’s a lot more to running the code than just the tests. We had lots of development-mode only code that set everything to run on a laptop, none of which was tested by our CI tasks. This meant several more days of chasing down errors we had no idea existed.

Partial, Reversible Deploys Are Essential
We had several preproduction environments in which to test our 1.9 performance. But none of them receive even a fraction of the traffic our production site does, nor do they have even a fraction of the dataset to work with. So when the time came to deploy the upgrade, we decided to do one server at a time to see how they fared.

We quickly discovered two things. First, 1.9 was performing about 80% slower than 1.8. And second, our load balancers didn’t think this was a problem and gave it just as much traffic as the other servers. Then things started to get ugly.

App Instance Busy

We scrambled to fix the load balancer by switching from round robin to least-connections as our balance strategy. This reduced the load on the now poorly performing server so we could troubleshoot the performance problem.

After many hurried code changes, we discovered that our own Ruby agent had a poorly performing garbage collection instrumentation strategy under 1.9, which we patched right away and later released as version 3.5. With the patched agent, 1.9 went from 80% slower to 30% faster. High fives were had all around.

Average Response Time

We would have got that 30% improvement much more quickly if we had actually done the fire drill of taking a server out of rotation before introducing the Ruby version change.

It Really Works
We have a bias for measurement here and when you make a big change, such as this one, having some charts on your side can be a tremendous help. Especially when you’re trying to make the project about more than just debt, having load charts that to down and throughput charts that go up are a tremendous asset.. In our case, 1.9 was so much faster that it was like getting a free web server.

Top 5 CPU Consumers

This machine is delivering more traffic with less CPU:

Bandwidth

We can look back now and see that this was an upgrade that delivered real user happiness. We can reliably serve page to our users in less than two seconds, any time of the day.

Browser Page Load Times

Closing Comments
The switch to Ruby 1.9.3 represented a major feature upgrade to New Relic. While it was an enormous project, it has improved every aspect of how our code is run and managed. We hope our lessons learned help you achieve the results you’re looking for.

Want to use New Relic and get an awesome Nerd Life tee?
Sign up here. It's free, so why not?

Comments

RSS feed for comments on this post.


  1. What kind of big hassles did you run into during the migration? Are there any classes of changes which were big pains? (The one that comes to mind for our app is date parsing has changed subtly).


    Jonathan Owens Reply:

    The biggest pain was outdated, unmaintained gems. Lots of gems have been abandoned in the jump to 1.9 or replaced by other ones. We made heavy use of Bundler’s :platform directive to build a cross-compatible gemset.

    Language changes that hurt us in the application itself tended to be areas where we were monkeypatching odd classes, using procs and lambdas in unusual ways, or generally going off the beaten path.


    Posted: 25 October 2012 at 8:55 am by Chris Schneider

  2. How many man hours/months did it take you guys?

    Why are you guys still running rails 2.3?

    We did thest upgrades last february, soooooo painful!!!!!


    Jonathan Owens Reply:

    About 8 months of on-and-off work – this was not a full-court press but it was a lane of development. The last two months were primarily operations work to support switching Ruby versions.

    We wanted to get to 1.9 before attempting Ralis 3, so now that we’re there, Rails 3 work has begun in earnest.


    Posted: 25 October 2012 at 10:22 am by Michael Economy

  3. Hi,

    Great article. 10x for share.

    I’ve two and a half questions regarding this upgrade. :)

    1) Which Linux distribution are you guys using (if it is Linux)?

    2) So using Puppet to install rbenv, can I assume you have used the system Ruby to install Puppet in your servers? If not, can you write some lines about this please?

    Once again many thanks :)

    Cheers,
    Francisco


    Jonathan Owens Reply:

    Thanks!

    1. CentOS 5, but we’re in the process of upgrading to 6.

    2. Yes, Puppet uses the (ancient) system Ruby, but we’re going to have to cross that bridge soon because Puppet 3 has dropped support for 1.8.5. You can use some PATH tricks to get a different Ruby install prioritized, but it’s not something we’ve had to chase all the way down yet.


    Posted: 30 October 2012 at 7:27 am by Francisco