One of my favorite features in New Relic is the Scalability Charts, and I’m not just saying that because I implemented them. They show an app’s performance profile using dimensions not typically captured in traditional performance monitoring software, which is all about time series data. Time series data is invaluable for understanding how things change over time:
You can see when things go bad, and for how long. But often what you want to know is how is the site responding under load? At what point is a bottleneck resource saturated? What is the bottleneck resource for that matter–CPU or database? For many sites you can see how the response times increase when the load increases, but is it increasing linearly?
The Scalability Charts show you server response times, database time and CPU time all plotted against load. Instead of a line or bar chart you get a scatter plot, along with a third dimension, time of day, represented by the color of the dot.
Here’s an example plot of the performance of the New Relic site over a 24 hour period:
This is a fairly typical chart for response time. It shows our app tier response time consistently around 350 ms even as we hit our peak loads which occur in the early afternoon on west coast time. You can see how the load varies over time because of the clustering by color. The time on the Y axis represents the total time spent servicing the request so that will include both database and CPU time. The data is pretty scattered which is common in the 24 hour view.
If you look at the CPU and Database graphs, you’ll see pretty flat lines as well, which makes sense because they are just components of the overall response time.
The CPU graph is a little more interesting in that it has less noise in it. This is very typical. CPU graphs usually render with a level moving average because the CPU demand per request generally won’t change with the load–it’s the same amount of work regardless of how many other requests are being serviced concurrently. On the other hand, the graph showing the database time per request is much more noisy. In our case this is because of variation on the front end of the size of the data as well as the effect of background jobs going on outside the web application.
Much of this information is available in New Relic’s app overview charts. The scalability graphs quite interesting though, because they make it easier for you to see patterns which can reveal more about the way your application works or behaves in ways not always expected.
Here’s an example of a site whose response time degrades with higher load.
Most often you’ll find this reflected in the database graph because the DB tends to be a bottleneck, but in this case the CPU graph shows increasing CPU demand at higher load.
What would cause an increase in CPU time required to process requests when the load is higher? The answer could be in caching. A higher load could cause a lower hit rate on a cache as entries are evicted sooner, meaning more work per request. But since the CPU time also correlates to time periods, it could be that at certain times of the day key pages have to do more work.
In any case, it would be hard to see these kinds of patterns just looking at the time series data, which for the same period looks like this:
I’ve seen many other interesting patterns in the scalability charts. If you have a Professional subscription or a trial, be sure and check out your app’s scalability graphs. Are they what you’d expect? See any odd patterns you can’t easily explain? Let me know and I’ll see if I can come up with a theory for you.
By the way, although I implemented the original scalability charts, they were based on Lew’s prototype so let me just say once again, great idea Boss!