5 Common Sources of Alert Fatigue for SRE and DevOps Teams

5 common sources of alert fatigue for SRE and DevOps teams

Published Jan 23, 2020 3 min read

If you’ve ever been an on-call SRE, you’re familiar with alert fatigue: the burned out feeling that creeps in after responding to alert after alert from tons of services and tools across your stack. Not only is this phenomenon exhausting, but constant pages also limit your ability to focus on other work, even if you’re simply clicking “acknowledge” (“acking”). Research has shown that people lose up to 40% of productive time with brief context switches. Many of the alerts causing never-ending streams of pages are neither urgent nor important, and don’t require any human action.

So, where are they coming from?

Here are five sources of noise that can create alert fatigue and distract your on-call DevOps or SRE team from the real issues that need attention in your production system.

Irrelevant alerts

Unused services, decommissioned projects, and issues that are actively being handled by other teams are some sources of noise that are prevalent enough to be annoying but not always worth going through the legwork of turning the alerts off at their source. These notifications come from all kinds of tools in your production system and tend to get quickly acked but largely ignored since there usually isn’t an underlying actionable issue.

Low-priority alerts

Some noisemakers indicate problems that may eventually need to be addressed, but are low on the current priority list. Keeping these alerts configured can be a useful reminder to investigate or address the root cause of the issues eventually, but in the short-term, they’re probably not adding value.

Flapping alerts

Acking flapping issues can feel like playing whack-a-mole. These alerts are a good indicator of a growing problem in your system but can be a source of distraction when you’re trying to problem-solve, sometimes prompting SREs to silence pages or blindly ack incoming issues. Unrelated issues can sometimes get lost in piles of flapping notifications, which can be a risk to your team’s ability to notice important problems.

Duplicate alerts

Similar to flapping alerts, but more a symptom of redundant monitoring configuration than an underlying production issue, duplicate alerts can be another source of pager fatigue. You’re aware of the problem after the first notification, so additional alerts letting you know that it’s still there can add frustration.

Correlated alerts

These are the toughest but possibly most important sources of noise to identify. Getting to the root cause of issues is way faster with all of the context about the impact of the issue across your full stack, and missing this context can lead you down rabbit holes of investigation and troubleshooting that aren’t worth your time.

Take a quick scroll through your team’s pages from the past day or week and think about each one. How many fit into one of these categories? Noisy pages like these create distractions, build frustration, and hide real problems, and as the complexity of modern production systems continues to grow, the volume will only increase.

Cure alert fatigue with the right solution

Implementing an AIOps platform, like New Relic AI, can help you tackle alert noise across your stack and create a continuously-improving, streamlined system for correlating and prioritizing incidents. Many layers of machine learning-driven filters and logic power New Relic AI. A correlation engine looks for all of these sources of noise. It also adapts to continually provide more relevant alerts, reducing pager fatigue and empowering your team to stay focused on important issues. Learn more about New Relic AI (currently in private beta) today.

By Guy Fighel

Guy Fighel is the General Manager of Applied Intelligence and Group Vice President of product engineering at New Relic. He leads New Relic’s AIOps product and engineering, and is responsible for the company’s overall artificial intelligence and machine learning strategy. Guy was the co-founder and chief technology officer of SignifAI, an event-intelligence company, which was acquired by New Relic in 2019.

The views expressed on this blog are those of the author and do not necessarily reflect the views of New Relic. Any solutions offered by the author are environment-specific and not part of the commercial solutions or support offered by New Relic. Please join us exclusively at the Explorers Hub (discuss.newrelic.com) for questions and support related to this blog post. This blog may contain links to content on third-party sites. By providing such links, New Relic does not adopt, guarantee, approve or endorse the information, views or products available on such sites.

750+ integrations to start monitoring your stack for free.

See All Integrations See All Integrations