Editorial note: This is an update of a post that originally ran in April, 2019.
For many DevOps, site reliability engineering (SRE), and operations teams, it still takes too much time to detect potential problems before they turn into incidents. Teams often work reactively, firefighting incidents while never finding time to implement processes that allow them to identify issues before they cause outages.
Every minute teams spend responding to incidents is a minute that negatively impacts their service-level objectives (SLOs), their companies’ reputations, and their teams’ bottom lines.
Even way back in 2014, Gartner estimated the average cost of a minute of downtime at $5,600; in 2020, the impact on large organizations at critical “moments of truth” could be much, much larger. Indeed, simple stats like this underscore the importance of responding quickly and effectively to any incidents that affect the availability or performance of your site.
What makes an “incident?”
Basically, an incident occurs any time a service is not available or does not perform in the way it has been defined to—typically through a formal service-level agreement (SLA). Incidents can be caused by a variety of factors: network outages, application bugs, hardware failures and, increasingly, in today’s complex and multilayered infrastructures, configuration errors.
Incident response refers to the collective processes that help detect, identify, troubleshoot, and resolve such incidents. Strongly influenced by the IT Infrastructure Library (ITIL) by the British government in the 1980s, incident response has evolved over the years to include many frameworks and approaches. They all share a common goal, however: giving stakeholders the tools they need to get misbehaving systems up and running again ASAP, while also making those systems more robust and reliable.
But despite its long history, incident response is still shrouded in myths and hobbled by misperceptions that prevent companies from resolving incidents as quickly and effectively as they could—and perhaps more importantly, from learning how to reduce the occurrence of incidents.
That’s why we asked incident response experts at New Relic and around the industry to identify common incident response myths and mistakes, and share their insights on best practices for optimal incident response.
Myth #1: Speed is everything
Also known as the “any-fix-is-a-good-fix” myth. Rapidly resolving issues is obviously important, especially for systems that directly touch customers. But it’s not the only thing to worry about. A bad or incomplete fix, or a temporary fix, or a fix that breaks something else downstream, can be dangerous to implement in the name of speed.
“A lot of lip service is paid to the need for quality and customer satisfaction in incident response, but when you look at a lot of the metrics for measuring incident response success, they actually mostly focus on efficiency: how fast an issue is resolved,” says Christoph Goldenstern, vice president of innovation and service excellence at Kepner-Tregoe, a training and consulting firm specializing in incident response.
Instead, businesses should focus on the effectiveness of the end result as well as the speed. “Are we ultimately giving the customer resolution in the long term?” Goldenstern asks. “Are we preventing the same thing from happening again? Those are the questions to ask.”
He adds that focusing on “lagging indicators,” or looking backwards to measure how something was done, is not terribly effective. Rather, he says, businesses should concentrate on improving behaviors that drive better and long-lasting results, and create metrics around those.
One metric that Kepner-Tregoe encourages clients to use is the time it takes to get to a good statement of the problem at hand. “We know from our research that the quality of the problem statement is a direct driver of lower resolution time and higher customer satisfaction,” Goldenstern says. “Training your people to create clear, concise, and precise problem statements as quickly as possible will serve you better than simply putting a fix into place.”
Myth #2: Once you’ve put out the fire, you’re done
This myth is, happily, slowly being eradicated. These days, it’s fairly standard to have some kind of post mortem or internal retrospective after resolving an incident. The point is to proactively learn from the incident to make your systems more robust and stable, and to avoid similar incidents in the future. The relevant phrase here is, “proactively learn.”
“It’s really important to incentivize measures for prevention as opposed to just resolving incidents in reactive mode,” says Adam Serediuk, director of operations for xMatters, a maker of DevOps incident-management tools. If you don’t dictate that your incident lifecycle doesn’t end until that postmortem is completed and its findings are accepted or rejected, “you’re effectively saying, ‘we’re not really interested in preventing future incidents,’” says Serediuk. There’s a difference, he adds, between reacting and responding. You could react to an incident, for example, by throwing some of your rock stars at it, and fixing it right away. “But that process can’t be easily repeated,” he notes, “and it can’t scale.”
It’s important to think of incident response as an end-to-end process in which the response is measured, iterative, repeatable, and scalable, agrees Branimir Valentic, a Croatian ITIL and ISO 20000 specialist at Advisera.com, an international ITSM consultancy. “The point of incident response is not just to resolve, but to go much deeper, and to learn,” he says.
One risk is that over time the post-mortem can turn into a rote exercise—just a box to be checked by jaded engineers. Don’t let the postmortem become busy work. Learning from incidents is incredibly valuable but also challenging and requires you to constantly tune and adapt to figure out how to learn effectively.
Myth #3: Report only major issues that customers complain about, to avoid making IT look bad
Another prevalent myth holds that you shouldn’t be overly communicative about your incidents. If you report every incident, the reasoning goes, IT can look as though it’s failing. It’s better to keep your head down and acknowledge and communicate only the serious incidents that customers have noticed and reported.
That’s the theory, anyway, but it’s a really bad idea. Customers—and internal stakeholders—want to feel that you’re being honest and transparent, and that they can trust you to detect and acknowledge incidents that could impact them. Hiding incidents—even minor ones—can destroy that trust.
You shouldn’t view it as a black mark against your IT organization when things break. Having incidents is just part of the game. The key is what you do about them.
Be proactive about communicating, both internally and to customers. A lot of companies are paranoid about sharing any information unless they’re basically forced to, but that’s a mistake. Be transparent.
Myth #4: Only customer-impacting incidents matter
A related myth is that only incidents that impact external customers are relevant. In fact, some organizations even define incidents solely as “customer-impacting disruptions.” But believing that myth will reduce your overall incident response effectiveness. Again, the idea is that incident response should be a learning experience—and that you should take proactive actions based on that learning.
“There’s a lot to learn from internal misses and internal-only incidents. They might even be some of your best learning experiences because it’s a chance to hone your response process and learn without pressure,” says xMatters’ Serediuk. “It’s hard to instill true organizational change when things are on fire.”
Say your internal ticketing system goes down or your internal wiki blows up. What type of oversight or lack of control allowed that to happen? In relatively minor internal situations like these, “you can learn under less pressure and perhaps avoid production incidents later on,” says Serediuk. With lower pressure you may be able to focus a little more purposefully on why you had a particular problem, as well as how to prevent it from popping up again.
Myth #5: Systems will always alert you when they’re in pain
Operations folks tend to monitor what they believe to be important. But they’re not always right. When that happens, a system could be in trouble, and your team could be blissfully ignorant. Historically, ops teams looked at such metrics as disk utilization, CPU usage, and network throughput. “But the issue is really, is the service healthy?” says Serediuk.
This comes down to the difference between macro and micro monitoring. In micro monitoring, you’re looking at individual components such as CPU, memory, and disk. With macro monitoring, you’re looking at the bigger picture, which is how it impacts the systems’ users.
“This is where service level objectives [SLOs] and service level indicators [SLIs] come into play,” says Serediuk. “You’re judging things by the user experience.” For example, if all of a sudden your web requests per second drop to zero, you know you have a problem. If you were merely doing micro monitoring, such as keeping tabs on memory utilization, you could have missed it. “By looking at the metric that mattered—whether users are engaging with my system,” he notes, “I catch something that I might not otherwise have noticed.”
Myth #6: You can tell how well your IM processes are working by your mean time to resolution (MTTR)
The MTTR is just what it says: the mean (average) time it takes to resolve an incident. But problems abound with depending on this metric as your barometer for incident response success. For starters, all incidents are not created equal. Simple, easy-to-resolve incidents should not be judged with the same metric as more complicated ones.
“How do you compare an enterprise-wide email service going down with an application that has only a handful of users, that maybe suffers from one easily resolved incident every other month?” asks Randy Steinberg, a solutions architect with IT consulting firm Concurrency. “Incidents are so varied, it’s not a good barometer of how well you’re doing.”
Also, measuring MTTR is itself an art, not a science. For example, when does the clock start ticking? Is it when an application starts slowing down? When you get your first alert? When a customer notices? The boundaries of complex systems are so fluid, this is a difficult metric to capture consistently. MTTR can be useful if your incident response time is so poor that you’re trying to get it down to an acceptable number; otherwise, it can be very misleading.
Myth #7: We’re getting better at IM because we’re detecting issues faster and earlier
Thanks to the increased efficacy and granularity of automated monitoring and alerting tools like New Relic, businesses are getting much better at detecting incidents than was previously possible. But that doesn’t mean we’re getting better at incident response. Detecting an incident is only half the equation. Resolving it is the other half.
“What’s interesting is that if you look at the overall process, we’re not getting better at responding to incidents in general,” claims Vincent Geffray, senior director of product marketing at Everbridge, a critical-event management company. Why? Because all the gains that we get in the first phase of the process—detecting incidents sooner—are wasted in the second phase of the process, which involves finding the right people to resolve the issue. “It can take a few minutes to detect an issue and then an hour just to get the right people to the table to begin figuring out a solution,” he says.
The remedy? AI and machine learning can help by assessing historic data related to incidents and suggesting possible responders based on previous incidents. Additionally, take the time to study the steps in the incident response process, with an eye toward making them more efficient. That’s where the biggest gains have yet to be achieved.
“What happens in real life, Geffray says, “after a tool like New Relic has identified a problem with an application, is that a ticket is created in your ticketing system, and then you have to find the right people, get them together, and give them the information they need so they can start investigating.” In most cases, it’s not going to be one person. “Studies show that most IT incidents require a minimum of five people to be resolved,” he notes. “And as you can imagine, the higher the number of mission-critical applications, the larger and more distributed the organization, the more time it takes.”
Myth #8: A “blameless culture” means no accountability for incidents
This is an important myth to dispel, given the (overwhelmingly positive) movement in the IT industry toward a blameless culture.
On the plus side, a blameless culture removes fear from the incident response equation: People are much more likely to be candid and transparent when they know they are not going to be fired for making a mistake. But that doesn’t mean no accountability. You should still identify which mistakes were made, and by whom, so as to learn from them.
There’s a big difference between accountability and blame. Blame typically misunderstands the nature of complex systems, in which a particular mistake is likely to be more of a triggering event that tips over the dominoes of latent failures. A blameless culture actually enables true accountability, because individuals and teams feel safe enough to be transparent about missteps so that the organization can improve the overall system.
Myth #9: You need a dedicated IM team
While some companies choose to have a separate, dedicated incident response team, others prefer to rotate people through regular IT engineering jobs, In fact, there are many reasons why you would want incident response skills to be distributed throughout your IT organization.
In DevOps approach, any engineer should be able to respond to any incident in any role. Responses to day-to-day incidents should be distributed across the whole organization.
Empowering any engineer who has the necessary information to make tough calls during an incident is crucial. Empower whoever is responding to be able to make difficult decisions and know that they can do their best and make the call.
Of course, all this requires intense, in-depth, and continuous training, as well as repeatable, iterative processes. You want to have the best possible resources in place to address the biggest incidents which requires proper organization and well-honed processes. Every engineer on call should have sufficient training and enough experience to make good calls—and they should also have support should a call go sideways.
Check out Accelerate Incident Response with AIOps to learn more about how New Relic Applied Intelligence can help you improve your incident response process and bust these myths.