A couple months ago, we saw a pretty significant drop in the application performance index (Apdex) score for New Relic’s Key Transactions Index page. This was concerning to us, since this is a page that receives a good amount of traffic and shows summary information for our users’ most important transactions. But by using New Relic Insights, we were able to pinpoint exactly what caused this drop (spoiler alert: it actually wasn’t anything specific that we had done!) and get a fix in place ASAP. Here’s the play-by-play:
Spotting the Initial Problem in New Relic APM
It all started when we noticed this drop in Apdex score. The corresponding increase in response times seemed to occur around early in the morning of the previous Friday:
When we first started investigating this issue, we tried a few different avenues, all focused on changes that we may have made on our side:
- Reviewed the commits that had gone out in the past few production deploys
- Looked for any changes to services this page depended on
- Asked our Site Engineering team whether they had done any infrastructure changes around that time
- Compared transaction traces before/after this point in time
None of these avenues yielded particularly interesting leads, until we took a look at the histogram of response times for this page:
Typically, we’d expect to see some outliers, but that entire second bump on the right side of the histogram was pretty strange! In order to drill down to investigate just that segment of response times, we turned to New Relic Insights.
Digging Deeper with New Relic Insights
To start off easy and get some familiarity with how to write the queries for the data we wanted, we re-constructed that same histogram to verify we were looking at that same two-bump chart:
To focus on that second bump, we modified the query to limit it to only those responses that were longer than 1.5 seconds. Then, just for fun, we thought we’d try faceting the data by account:
Over 96 percent of these slow responses were all from one customer! We could then focus on just this one customer’s usage of that Key Transactions index page:
The increase in their requests to this page corresponded exactly to the timing of the drop in the Apdex score for the Key Transaction index.
Here’s that Apdex chart with the same data that we saw in APM:
To confirm that the drop in Apdex score did indeed stem from the sudden increase in requests by this customer, we took a look at the Apdex score for this page with this customer excluded, which showed that the Apdex score for other customers had actually stayed pretty steady:
All of these charts combined together formed a dashboard that told a story behind the data we’d seen in APM:
When we see issues with our site’s performance in New Relic APM, our first instinct is to see what we may have done on our side recently that caused this issue, to try to reverse that change. By using New Relic Insights, we were able to discover greater depth behind the numbers. We found that a sharp increase in requests from a customer who had a particularly slow response on this page affected the overall Apdex score. So rather than fixating on a recent commit to revert in order to improve the performance of this page, we could focus on improving the experience for those high volume customers.
Since the key to solving this puzzle was when we faceted by account, one interesting usage of New Relic Insights dashboards would be to set up a generic dashboard template, with widgets already configured to facet on different aspects of your data. Then, whenever you notice an issue with a key transaction in New Relic APM, you can quickly jump into Insights and filter that dashboard for that particular key transaction. This way, you can diagnose the problem at a glance by looking for any charts that stand out, helping free up your time to start fixing ASAP.
Want to learn how other New Relic teams are using Insights?
- Check out this post from our Marketing team about using New Relic Insights with Marketo Webhooks.