Drinking our own Champagne: using RPM to optimize…RPM.

After the dust settled at New Relic on our complete site redesign (what we call V2), we figured it was time to put RPM to good use and tune our site. The first thing we did was to set a goal, which, for this exercise, was to reduce the response time of our data collection tier in half and buy back as much CPU as possible. The business driver for this initiative was that our current hardware was nearly at capacity and we wanted to see if we could tune our way out of having to purchase more equipment.

Here was the SLA report for our data collection tier:

Notice the growth in requests processed from 09/29 through 11/10. However, after 11/10 the requests processed starts to drop, Apdex score gets worse, and response time climbs.

Here is a 6-hour breakdown of where time was being spent in our data collection tier. As you can see, most of it is primarily in the database.

Our data collection tier was initially written in Ruby, however we realized that we could compress our incoming data stream and a while ago we introduced a new component called the aggregator. The aggregator sits between the ruby data collector and the mysql database. We’ve been running with about 1/2 of the accounts using the aggregator for a few weeks, making sure that everything worked OK. So, before we did any database tuning we decided to flip on the aggregator for the entire account base. This in theory should have reduced database load.

Unfortunately, as a result of the change our Ruby collector response time took a negative hit. Here’s a breakdown graph showing what happened after flipping on the aggregator for all accounts:

Response time was about 30% worse than before and the amount of time spent in the database and in Ruby increased! Plus, our throughput, which is normally fairly constant, dropped by 1/3, which meant we were saturated. Digging around a bit we came up with a theory – that our database load profile had changed and inserts from the aggregator were much larger and therefore apt to prevent other inserts & updates from executing in a timely manner. Evidence of this can be seen on the following breakdown graph between DB save and find time. Notice the spike in save time starting at 2pm:

While we consider ourselves experts when it comes to building web application monitoring software, admittedly, our knowledge of MYSQL only goes so far. Fortunately, we keep the MYSQL experts at Percona on retainer and we setup a conference call with one of their engineers. He strongly encouraged us to upgrade the version of MYSQL we were using so that we could have more control over concurrent database write threads. Sure enough, look what happened to save time at 4pm when we upgraded our database:

On this next graph, you can see what happens to our top 5 database operations. AgentRun#save time, the area in yellow, drops dramatically:

But, we still had one more database optimization to do. RPM’s “SQL – SELECT” metric tracks custom SQL statements. For our data collectors, this is primarily a command we issue to check for the existence of a database table. After reviewing the code, we decided we could use memcache and only check for the existence of the table once. We deployed that change to production at 8:30am and saw the SQL – SELECT time drop. We also saw a big improvement in other database operations – apparently those select statements were gumming up MYSQL:

Now that the databases were running smoothly again, we went back to the data collector to see how things looked. Response time was back to around 96ms, however that was where we started before we switched all of the accounts over to the aggregator. The key difference now was the breakdown. Almost none of the overall response time is in the database after the 8:30am change:

Ruby time appears to have increased, however we’re actually handling more throughput after the 8:30am change than before. Nevertheless, to tune our data collectors any further we’d need to look at the ruby code. Here’s a breakdown of the #1 web transaction in the data collector:

About 35% of the time is spent preparing and sending metric data to the aggregator. We were using json to send data to the aggregator, however we wrote a simple benchmark and discovered we could put our data into Arrays and use the pack() method to create a byte string that the aggregator could easily read. It should be noted that the aggregator is written in Java and we wanted a format that could be read from a DataInputStream. We pushed out the new communications protocol and were amazed at the result:

Not only did our overall response time drop to 48ms, but we saw a substantial the drop in “External Services” time, which is the http post to the aggregator. After digging for a bit we discovered what had happened. The aggregator was spending most of its time deserializing JSON and when we switched to a native byte format it got substantially faster. We’re also sending 1/4 the amount of data using a byte string vs. JSON, which makes the HTTP post a little faster as well.

Here is the SLA report showing how much progress was made:

Thanks to RPM and to a fantastic effort by the New Relic team, for the week ending 12/15 we see a significant rise in requests processed, a drop in average response time, and a big improvement in Apdex–we had definitely  reached out goal. Overall response time for our data collectors was cut in half and we reduced CPU burn by about 30%. We are now delivering a better customer experience and have saved ourselves tens of thousands of dollars in new equipment purchase.

Visit our web site to learn more about RPM.

marketing@newrelic.com'

View posts by .

Interested in writing for New Relic Blog? Send us a pitch!