I deal with alert fatigue every single day, and not always as part of my site reliability engineering (SRE) role at New Relic.
I’m a type 1 diabetic. That means my body doesn’t produce enough insulin. And so, with the help of an application on my phone that connects to a sensor on my abdomen, I have to manually measure my blood glucose level throughout the day. This is the metric by which I live my life.
I’ve set my constant glucose monitor with two thresholds: 150 and 70. If I go below 70, I’m in a state of hypoglycemia, and I really need to drink some juice or stuff my face with Skittles. If the reading gets too low, I may require medical assistance. So I pay a lot of attention to when I go low.
I also have a glucose meter, which is a more accurate device that takes a measurement via a prick of my finger. Whenever I get an alarm on my monitor, I verify the value with this meter.
So the other night, at 3:21 a.m., my monitor alerts me that the level is at 66—too low. When I take a reading with my meter, however, it’s at 83—above the threshold. I go back to sleep.
But what if over time I learned to ignore these unreliable alerts, especially at 3 in the morning? I might get more sleep, but things could go really bad if one turned out be a real emergency and I required medical assistance.
This is my personal alert fatigue.
Now imagine you’re an engineer on call, responsible for maintaining the health of a complex set of interdependent microservices with strict service level objectives (SLOs). You need to know if the alert waking you up 3:20 a.m. is real and actionable.
Notifications are like walls of hate
When engineers are on call, they want to deal with what I call, “walls of hate.” A wall of hate is created when a small incident occurs in one of your environments and then blows out your chat room with a wall of PagerDuty—or similar—notifications. When teams can’t get over this wall, they soon fatigue.
At New Relic, the Reliability Fitness team (a collection of SREs) partners with engineering teams to help them create automation and software that reduces toil and technical debt. As a lead software engineer on that team, my role is to make sure that all of our engineering teams are proactively aware of their reliability and are proactively avoiding walls of hate and alert fatigue.
In this post—adapted from a talk I gave at FutureStack18 San Francisco, titled “Combatting Alert Fatigue”—I identify six approaches to combating alert fatigue that, when applied together, should greatly reduce alert-related stress, anxiety, and fatigue across your teams:
- Measure the frequency of pages on your teams
- Use New Relic Alerts’ incident rollup strategies
- Maintain policy hygiene
- Leverage custom instrumentation
- Use baseline conditions
- Create runbooks for your conditions
What is alert fatigue?
Dirk Stanley, MD, MPH provides two useful definitions of the symptoms and conditions that lead to alert fatigue:
- Alert overload: When the number of low-risk alerts vastly exceeds the number of high-risk alerts. (When PagerDuty pings you at lunch and you snooze the alert.)
- Alert loss: When the number of ignored alerts exceeds the number of valid alerts. (When you snoozed that one alert, but it turned out to be the precursor to a Severity 1 incident that caused major customer outages.)
When the system gives you too much information, you may miss the important stuff. It’s more than just the signal-to-noise ratio—it’s a cognitive weight sitting on your brain. It makes life difficult. It’s painful. It makes it so you don’t want to be on call.
With that in mind, here’s my working definition of alert fatigue: When you’re overloaded, you overlook real issues.
It’s my contention that alert fatigue leads to increases in response times and in mean time to resolution (MTTR). All of which extends the time your customer is impacted, and ultimately, the amount of time that your business is impacted.
So what can we do about this? How can we proactively prevent alert fatigue?
1. Measure the frequency of pages on your teams
You can measure the health of your teams based on the number of pages they receive. At New Relic, we’re serious about this metric. We believe that if an on-call engineer is paged more than once or twice in an on-call period (typically one to two weeks), that’s too much. We expose this information at the engineer level with an application called PagerStats, and then we aggregate that information so managers can see how their teams are behaving. Here’s an example of a week that I spent on call, aggregated with my manager’s, Elisa Binette’s, entire team.
We gather this data, and we act on it. For example, if an individual engineer has a really tough on-call week, we might rotate them off-duty so they get a break from that particular system. We also recognize that we need to address when a team receives too many alerts. If they’re receiving too many false positives, we spend time adjusting the thresholds for their alert conditions. If the alerts come from legitimate problems, we need to provide more resources so the team can resolve the root causes of the alerts.
2. Use New Relic Alerts’ incident rollup strategies
In many systems, when an alert condition is violated, you’ll get a message for every violation unless you filter and aggregate them through another service.
However, with New Relic Alerts, you can create an incident policy for how you want to filter and aggregate your messages. You have three options for rolling up alerts related to that policy:
- By policy: All violations within a policy are grouped into a single incident; you’ll have only one open incident at a time for this alert policy.
- By condition: All violations with a condition in this policy are grouped into a single incident; you’ll have only one open incident at a time for this alert condition.
- By condition and entity: An incident will open every time an entity violates a condition in this policy.
When you create an alert policy, you’re creating a mechanism to group all conditions that might be related to a particular portion of your environment. The following example shows a policy designed around the frontend user experience; we’ve created multiple conditions that represent different aspects of how our customers are using the frontend of our application. If one of these conditions fires, it creates an incident an engineer will respond to.
Once these incidents are rolled up at the policy level, if a subsequent violation occurs while an engineer is already troubleshooting, New Relic Alerts won’t send another notification. Instead, it will attach that second violation to the original incident, so the engineer can view it when they need to.
With incident rollups, you avoid overloading engineers with duplicate messages and provide them with the context they need to resolve an issue.
3. Maintain policy hygiene
A key way to be proactive about reducing alert fatigue is by maintaining policy hygiene. At New Relic, we spend a lot of time ensuring that our alert policies are “clean.” Specifically, we will:
- Use a consistent naming structure. When an engineer is paged, they need to identify clearly the policy issuing the alert, and they must also identify the actual metric upon which the alert condition is set.
- Empower engineers to adjust alert thresholds as needed. Your engineers work within your system daily. They’re the ones waking up in the middle of the night. If they find the thresholds aren’t correct, they should have the autonomy to adjust them as needed.
Maintaining policy hygiene protects against long-running open incidents. When engineers are troubleshooting and triaging an incident, they look to see where alerts are happening in other parts of the system. Are other teams being affected by this incident? Are their upstream dependencies affected? A lot of noise in the system, a lot of open incidents, muddies the water and makes it difficult for troubleshooters to tell whether a problem is localized or if everything is on fire.
When an engineer gets an alert, they should be able to: 1) Triage and resolve the incident; or 2) Adjust the alert thresholds so a particular alert doesn’t continue to trigger notifications.
At New Relic, we also manage thresholds with risk matrix planning: SREs work with engineering teams to asses known or potential risks to their systems and services. With risk-matrix planning, teams can pre-validate alert accuracy and find new areas for new alerts.
4. Leverage custom instrumentation to alert on critical KPIs
New Relic language agents gather a wide variety of metrics about how your applications are running. In some cases, though, the default metrics—for example, error rate and throughput—may not be quite what you’re looking for to track the KPIs that matter to your business. For such cases, each agent provides additional API functionality in which you can record custom metrics and custom events for New Relic Insights that match your KPIs.
However, it’s easy to get carried away with the power of custom metrics and events. If you set an alert for every single KPI, you’ll make alert fatigue even worse.
When using custom instrumentation to track your KPIs, set alerts only on those conditions that demonstrate the greatest potential to impact your customers and your business. Make sure that you’re alerting on the correct metrics and not on all the things.
Tip: Visit New Relic Developers to learn more about using custom events and New Relic APIs to get your data in and out of New Relic.
5. Use baseline conditions
Earlier in this post, I talked about the importance of adjusting thresholds on your alerts. That’s another of way of saying that you can create baseline conditions for your alerts. Baseline conditions in New Relic Alerts can be set on any applications you’re monitoring with New Relic APM or New Relic Browser. You can also set baseline conditions against New Relic Query Language (NRQL) queries.
After you choose the APM metric or NRQL query for which you want to create an alert condition, Alerts uses previous values for that data to dynamically predict the data’s upcoming behavior—this is the baseline. From there you set your thresholds—including critical or warning thresholds. If your data escapes the predicted “normal” behavior based on these options, your team will receive an alert. You’re in charge of when and how you’ll be alerted, so hopefully you know where the lines of fatigue are drawn.
6. Create runbooks for your conditions
The end result of any alert condition you create is that it’s able to inform the on-call engineer that there’s a problem they need to investigate. The on-call engineer can become very frustrated, however, if they’re paged at 3 a.m. and have no idea why that particular alert condition was created—or what they can do about it.
This is where runbooks come into play. Runbooks for alerts should include:
- A description of why the alert was created
- What the alert is monitoring
- What that alert indicates about the state of your system
- Initial steps for an on-call engineer to begin triage
In New Relic Alerts, when you create an alert condition, you can also specify a URL to your runbook, which you should keep in GitHub or some other documentation repository.
Fatigue no more
Most engineers and operations folks would agree that alert fatigue is real and harmful to team health. Alert overload leads to loss—loss of time, loss of spirit, and, potentially, loss of revenue.
What matters is how you combat these risks. The six strategies listed here can help your engineers learn to trust the alerts and notifications they set up. And trust, in turn, creates confidence that every alert they receive will lead to action—not to fatigue.
To learn more about design principles and strategies for alerts, check out our tutorial on Effective Alerting in Practice: An Introductory Guide to Creating an Alerting Strategy for DevOps Teams.