New Relic User Group (NRUG) meetings provide opportunities for our community of users to come together and share best practices from their own experiences with New Relic products. As New Relic’s field and community marketing manager, I have had the privilege of personally attending more than 10 NRUGs around the world, and have witnessed community members sharing a wide variety of helpful tips and tricks.
This post is designed to let an even wider group of New Relic users take advantage of the community’s expertise, centering around how to use New Relic Alerts in the real world. We’ll start with insights about the items on which you should set alerts, talk about what criteria to use to trigger alerts, and then walk you through how to configure your Alert Policies.
What should I measure?
If you are in operations, you’re likely interested in measuring the uptime and response time of your application. Holding third-party assets accountable for their service level agreements (SLAs) is also important. Alerts are a great way to measure adherence to your SLAs and receive warnings of things you should be aware of—imagine that!
Here’s a short list of possible items to monitor and receive alerts about:
- Average response time
- Average request volume / minute
- Error rate
- AWS instance overage
- Basket size minimum
- Unreachable third-party assets
While far from exhaustive, this list should be enough to get you started with possible Alert Policies. Over time, you might decide that there are other metrics that you need to keep an eye on. As you identify opportunities for monitoring, you can expand the scope of what your team covers. (If you’re doing something cool with Alerts, let me know about it at firstname.lastname@example.org and maybe we’ll feature it in a future blog post!)
How do you establish the criteria for your newest alert policy? Well, everything starts with understanding your goals. Initially, you should create a baseline for each of your metrics and alerting criteria. Let’s say you’re aiming for a 4-second average response time for your application, yet currently your application is barely responding within 7 seconds.
It’s best to start with limits that sit within your current performance levels. So if your current page load times are about 4 seconds, you might want to set limits at 5 seconds. You can adjust your alert policy closer to your ideal as your team fixes performance issues and revisits what is considered “acceptable.” It is entirely possible that you won’t refactor certain parts of your application for months (or even years—yikes!), but having visibility into applications performing outside of their norms is exactly why monitoring is critical.
Not all applications will perform the same, and most alerting decisions should be made on a case-by-case basis. The goal is to evaluate your current state, then create alerts that are useful without being fatiguing (see 10 Ways to Find Your Alerting Sweet Spot with New Relic). You want to make sure you get Alerted for real problems, but not bothered about stuff that doesn’t really matter. (For more information about limiting Alert Fatigue, check out our blog post on the topic, Analyze Alert Fatigue With PagerDuty and Insights.)
How do I configure my Alert Policy?
Enough with the theoreticals. Let’s get into the nitty-gritty and look at how to go about creating an alert for a particular metric: Apdex. We’ll cover the basic usage of Apdex, and how to set up an alert to notify you when Apdex falls below a certain figure. As a best practice, try to set your Apdex threshold close to your average response time. Ideally, your threshold should allow your application’s Apdex to generally fall between 0.94 and 0.85. (If you’d like to learn more, check out our Apdex documentation.)
As an example, let’s set up two alert policies around Apdex: one to calmly alert the team on Slack when minor problems arise, and one that sounds all of the alarms and emails the entire team when thing go seriously bad.
Alert policies typically specify a few common elements:
- Policy name
- Incident preference
- Notification channels
- Alert conditions
Policy names should be standardized by your company. In this example, the name includes the group and briefly describes the alert policy condition and purpose. One common naming convention is [team].[service].[environment].[priority]. So Operations.Storefront.Production.ApdexLow.Warning is what we’ll name this alert policy.
Incident preference should be dictated by your team (you can read about the various choices in the documentation). In this case, we’re going to select By policy, which sends a single alert when the policy is violated. You will have to solve the problem in order to be notified of any future issues. If another alert is triggered immediately after the issue has been resolved, you should consider raising your thresholds, especially if everything is performing adequately.
Since this first alert is designed to be a warning policy, we’ll start with a Slack notification. So we’re going to select the Slack Notification Channel that we previously added.
Next, let’s add the Alert conditions that needed to trigger this warning. Remember, this first policy that we’re creating is a warning, not a full-on alert. Let’s set our warning to go out when Apdex falls below 0.85, and our critical alert will kick in on an Apdex of 0.70 or below.
Now that we’ve set up our warning alert policy, let’s set up our critical alert policy. We’ll follow the same steps, naming the policy according to our internal conventions—Operations.Storefront.Production.ApdexLow.Critical—and add the alert condition of Apdex at or below 0.70. Incidence preference will stay the same. Crucially, we’re going to tie this alert to a webhook that fires a Raspberry Pi in the corner, prompting the lights to flash and sirens to wail near our team members’ desks. If you have someone on call via PagerDuty, you’d probably want to add that Alert Channel as well.
Finishing up, and diving deeper…
That’s it! This is just a small sample of what you can do with New Relic Alerts, culled from the practical experiences of real New Relic users. Ideally, it will get your mental wheels turning to identify other things you’d like to monitor.