This is the third in a series of three guest posts demonstrating the basic principles of performance tuning from Flood IO co-founder and CTO Tim Koopmans.
In the first post of this series, we introduced the basic concepts of performance tuning and demonstrated how you can simulate load using Flood IO and analyze performance using New Relic. The second post used slow transaction and database traces to help identify and tune obvious problems in our application under test.
In this post we use New Relic to fix the remaining problems and confirm all are fixed in Flood IO.
New Relic includes an API you can use to collect additional metrics about your application. If you see large “Application Code” segments in transaction trace details, custom metrics can give you a more complete picture of what is going on in your application.
In our test application, Flood IO was still showing problems around caching. Looking at the source code revealed an existing tracer method in the CachingController.
This let us create custom dashboards within New Relic to present this data. It’s evident that no matter how many times this method is called, the minimum response time is always +30ms.
Looking at the code we can see this method is trying to make use of the Rails.cache however closer inspection identifies differences in the key name being read, and key name being written. Therefore the cache is never read from.
We quickly deployed a fix and confirmed success manually with a single browser session.
The tweets transaction is also slow and further investigation showed the majority of time was spent in a call to an External service: Net::HTTP[twitter.com]: GET
Outbound calls to Twitter from the TweetsController are going to be expensive under concurrent load.
By simply caching at the page level, we can get away with not having to execute the controller code for every request, thereby limiting the amount of outbound calls made to an external service.
Last but not least, we wanted to track down error events. New Relic’s event monitor makes this easy. We can get an idea of the error rate and when they are occurring under load:
We can also get a breakdown of the types of errors that occurred:
The stack trace pinpointed exactly where in our application code things were going wrong:
These are simply, innocuous functional errors, but the cost of serializing stack traces and handling those errors in a production environment can still be high. So it makes sense to resolve the division by zero error being reported.
The final part of a performance tuning test effort is to confirm that all the iterative changes hang together.
Our last baseline showed much better response time averages across the board, and easily satisfies the 4s target. We also eliminated any errors under load.
Once we’ve whipped the application under test into shape, it’s time to start load testing. We chose an arbitrary concurrency of 1,000 users with a response time target of less than 4s. We scaled out with 6x Heroku dynos and 3x grid nodes in Flood IO across the East and West coasts of the U.S. as well as Australia. We also contributed a Flood IO plugin to the New Relic Platform (see more information on the integration in this Flood IO blog post).
Flood IO and New Relic clearly make a powerful performance-tuning team. The great thing about the combination is that they put all the information in one place.
Of course, you can always keep performance tuning. Now that we’ve ‘fixed’ the initial round of performance defects, it will be easier to identify any new problems under sustained load. For example, it looks like request queuing is happening on the Heroku dynos, but that’s a story for another day…