I’ve been on a particular crusade to eliminate alert fatigue wherever we see it in New Relic’s engineering environment. We all know it’s no fun to wake up bleary-eyed at 2am, and be forced to solve a customer facing outage. It’s particularly un-fun when you’re awoken at 2am, only to find out that the automated notification isn’t actionable, and there’s nothing to do but silence the alert and go back to bed. It may be impractical to truly eliminate all spurious alerts, but realizing it is hard shouldn’t stop us from coming up with new ways to find those false negatives so we can all get some uninterrupted sleep.
It was with this mindset that I attended the awesome Monitorama PDX conference in my home town of Portland. The conference attracted a diverse group of Ops and Dev folks looking to discover together how we can improve the Monitoring tools and processes we use to make great teams and great products. Hats off to Jason Dixon and all his volunteers, who did a great job of organizing it.
There were a range of great talks around monitoring and DevOps, from getting started with data collections, to math for better anomaly analysis and detection. But I was mostly interested in one subject in particular: how others identify and reduce alert fatigue. Two of the talks dove deeply into that topic, and I’ll give you a run-down on the most insightful points they brought up below.
Trimming the Fat
In particular, I really appreciated the examples that Dan Slimmon of Exosite gave in his “Car Alarms and Smoke Alarms”. He focused on the accuracy of our status checks and probes, and how it impacts our ability to rely on them to give a clear indication that we need to act.
I took two large focus points away from it:
- You may think your 90% probable detection of a problem is good. It’s not. You want to look at the Positive Predictive Value (PPV) of a check.
- The Sensitivity and Specificity of your checks need to be really high the more available your service is.
The Positive Predictive Value is the probability that something is ACTUALLY wrong. This is affected by Sensitivity and Specificity:
- Sensitivity being the percentage of identifying actual failures in your service.
- Specificity being the percentage of identifying your service actually working.
He provided some cool calculations showing that even when you have a service or instance that has an rate of 99.9% uptime, and your check has a Sensitivity of 99% and a Specificity of 99%, your check is actually terrible. Why is this so terrible? His results show that if you are paged in the middle of the night with such a check, you only have a 1 in 10 chance that you need to take action. Looking back on my own experience, this doesn’t seem far the reality facing most people who spend a significant amount of time on call.
Outage Lifecycles and Monitoring Stacks
Dan’s talk really lined up with the great presentations given by Scott Sanders of GitHub (The Lifecycle of an Outage) and Daniel Schauenberg of Etsy (A Whirlwind tour of Etsy’s Monitoring Stack). Each of these presentations showed how their organizations equipped their Engineers to reduce the overhead burden when dealing with complex infrastructure. While the entire tool chains they employ are very similar to what many Ops teams already use, we can glean many ways to improve our own usage of them. They shared one commonality that I plan on implementing immediately: the auto-generated on-call report.
Github generates theirs via an awesome chat-ops interface, and Etsy has their Ops Weekly. Both allow the engineers to generate a report of all the alerts that paged out to them and easily annotate them. Most places I’ve worked at place importance on spreading status information on the key outages and impacts in their environments. Github and Etsy have built on this idea by annotating when an alert had no action and why there was no action taken. This can turn into a strong body of knowledge that would be essential in better tuning alerts and ensuring they are actionable.
Having attended a number of DevOpsDays conferences, which tend to discuss many broad DevOps topics include tools and culture, it was refreshing to see how narrowing in on the topic of monitoring and analysis of app data really can impact how we embrace DevOps as a culture.