How well is your web application servicing your customers right now?
That’s a common starting point when using and an APM tool like New Relic. To put it more succinctly:
How are you doing?
There’s plenty of evidence and opinion to suggest that watching your average response time won’t give you anything near a complete answer to this question, which leaves many customers choosing to rely on the Apdex measurement instead. So what does Apdex really represent? Let’s take a closer look.
Start by considering what it really is you want to know from looking at a chart when you ask the “how am I doing?” question. Let’s assume you are focused on the user experience and in particular page load times as seen from the user. So you could start with a chart showing the page load time of every request processed by your server over some representative sample period. Here is scatter plot of response times for one of our key transactions in New Relic:
This does give us all the information we want: every page request for a 20 minute period. Clearly the typical experience is maybe 1.0 – 1.2 seconds, where the center of mass is. But there are a lot of outliers and probably even many that are off the chart that we are missing completely. Is it really helpful? How are you doing?
Let’s try a histogram. This will show us a count of page load times within discrete buckets, giving us a better sense of how many outliers there are relative to the center of mass. Here’s the same data as above:
This feels a little more digestible. You can see the range of response times starting at around 100 ms and going up to 8 seconds. You can see that 3.6% of requests are completely off the chart. Does it really tell you how you are doing? Seems like you are getting there, but let’s try to add a little more information. Below is the same histogram but with markers to indicate the mean response time (green line), 95th percentile response time (dashed line) and the median or 50th percentile (red line). Also indicated are the middle two quartiles, the red region, representing the response times that fall between the 25th and 75th percentiles.
So now we’re getting somewhere. There’s a little more shape to our data. We have a much better sense of how significant those slowest loading pages are, and how prevalent they are. Most of our users are in the 1.2 – 2.4 second range, even though our average response time is 2.5 seconds. About 25% of our users have completely intolerable response times, more than 7 seconds.
How are you doing? Not very well.
At least we have a much more informed position from the data on this chart than we would if we erased everything but the green line. In fact, we could probably use even more information. To understand our user experience we’d probably want to know if this distribution is consistent across all our users, or if there vast differences among different browser versions or regions. Unfortunately the data we have at hand is often limited in the number of annotated attributes, so it’s hard to correlate.
Worse yet, we may not even have access to the event level needed to build charts like the ones above. Histograms and scatter plots require storing a lot of data. Even just a couple of percentile measurements like median and 95% can be tricky when you have to aggregate and resample data. So what you are left with often is just average response time.
Here’s where Apdex can fill the gap. Apdex is like a histogram with just three buckets: Satisfy, Tolerate and Fail. The buckets represent requests that are Satisfying the user, requests that users just Tolerate, and requests that Fail to meet the users’ expectations completely. The bucket intervals are 0, T, and 4T, where T is a parameter you choose in advance. Here’s what that looks like if you shade the Apdex buckets in our histogram:
I chose 1500 ms as a T value. The yellow region are the pages under 1500 ms. The red region goes from 1500 ms to 4T, or 6 seconds. The black region are the failed pages, those that took longer than 6 seconds. The Apdex score is a formula based on the count of pages falling into each of these three regions. You take the number of satisfying requests plus half the number of requests in the middle area, and divide by the total count. For this T value, our score is 0.7.
If you move the T value up or down, the regions adjust and so does your score. If I move the T value down to 1 second, the regions all shift left and my Apdex score goes from 0.7 to 0.49.
With T at 1 second, we are basically treating every request above 4 seconds as a failure. It doesn’t matter if it’s 4 seconds or 40 seconds–the result is the same: we failed. We are graded harshly for requests longer than 1 second, and anything under 1 second is viewed as serving our users. The difference between 800 ms and 1000 ms is not important to us because it’s probably not noticeable by our users.
So when you ask, how well are you serving your customers right now? This isn’t a bad place to start. Pick a T value that characterizes your expectations for your site. We are pretty comfortable with description based on a value of T=1 second so we’ll use that. Now we can use the score to answer the question how well we are serving our customers on a scale from 0 to 100. You’ll be in a much better place than you would be with the answer that simply states the average response time is 2 seconds.
How are we doing? We are at a 49 on a scale from 0 to 100. Not very good at all.
The other nice thing about an Apdex score is it allows you to answer the question quickly for an array of web transactions or applications. You can set a T value individually for each key transaction or application and see a list of scores and quickly identify where you need to focus on for improvement. Looking at a column of response times isn’t really going to help since those times might mean different things for each individual transaction or application. Is a response time of 2 seconds good or bad? It probably depends on a number of different assumptions. Many of those assumptions can be encapsulated with a well considered T value selection.
But the most important thing about an Apdex score is that unlike histograms and percentiles it can easily and inexpensively be collected and re-sampled.
Note: The charts and data used in this post are available for browsing using a tool I developed for experimentation with different visualizations called Marlowe, available on github.
New Relic users can check their own Apdex score here.
The Problem with Averages, David Heinemeier Hansson
What the Mean Really Means, Brendan Gregg
What Do You Mean: Revisiting Statistics for Web Response Time Measurements by David M Ciemiewicz (2001)
Marlowe data exploration tool