No matter what you do, sometimes your software fails. Maybe a key API server has lost its configuration. Maybe a database has lost connection with the authentication service on your frontend. Maybe finance forgot to pay the cloud hosting bill.
And when it does, your company’s Twitter account blows up, and the support phones ring off the hook. Red lights flash everywhere. As engineers scramble through the building, your customers get more and more irritated. It’s all hands on deck. You gather all your engineers into a huddle and say, “Ok, what’s our plan?”
With the amount of data we process per minute at New Relic, we can’t afford to figure out a plan during an incident. It’s critical to our business model that we act quickly, and with the utmost efficiency. We have to have a clear plan ready to go.
In part one of this two-part series—Managing On-Call Rotations the New Relic Way—we looked at our on-call practices at New Relic. Part two offers an overview of our incident-response process, which has evolved over the past several years as we’ve discovered what works best for our systems and our people. Our reliability engineering team updates the process and documentation regularly and makes sure everyone who goes on call has the resources they need to respond effectively to outages.
What qualifies as an incident at New Relic?
When our system behaves in an unexpected way that might negatively impact our customers, that’s an incident. Incidents are ranked on an internal severity scale. We include events that aren’t currently impacting customers but could turn into something worse if we aren’t careful, such as manual production changes. Customer-impacting incidents range from a bug in a minor feature (one of the lowest severities) to brief, full product outages (the highest and most rare severity). We classify the most severe incidents as emergencies, and these typically require elevated responses from Legal, Support, and Leadership teams.
During incidents, it’s always critical that we think about the incident in the context of the customer experience, the level of alarm we need to raise, and the assistance and urgency we need to resolve the problem. Often it’s a case of assessing the difference between a bug that can be repaired and deployed versus an actual service interruption.
Our incident serverities are based on a scale of 1-5 and are clearly defined in our internal documentation. An incident with severity level 5 should never have customer impact, and may be declared simply to raise awareness of something such as a risky deployment of one of our services. Level 4 incidents involve minor bugs or minor data lags that affect but don’t hinder customers. Level 3 is declared for incidents such as major data lags or unavailable features. The most severe incident levels, 2 and 1, are raised for incidents like the Kafkapocalypse from several years ago.
In the course of shaping our incident-response processes, we discovered that it’s important to make clear that during an incident, severities are used to determine how much support we need; whereas after an incident, severities are used to identify customer impact. We encourage engineers to escalate quickly during an incident so that they can get the support needed to resolve the problem. After the incident is over, we assess the actual impact and can downgrade the severity if it turns out the impact wasn’t as bad as initially feared.
How do we find out about incidents?
As an organization, our goal is to ensure we never discover an incident because an irritated customer is tweeting about it—that is the worst-case scenario. We’d also like to make sure we don’t have angry customers calling support, as that’s not an ideal scenario either.
At New Relic, we like to say we drink our own champagne (that’s our nicer version of “eating our own dog food”). Engineering teams have free rein to choose the technologies they want to use for a service, with one condition: the service must be instrumented. That means it must have monitoring and alerting (and we use our own products except in rare cases when a team’s use case isn’t already solved by an existing New Relic product). All engineering teams have on-call rotations for the services they manage. A good monitoring setup means an engineer will be paged as soon as a problem is detected—hopefully before it’s discovered by a customer.
Proactive incident reporting is critical at New Relic and helps ensure we’re able to respond to and resolve the incident as quickly as possible.
Responder roles for typical incidents
Here’s a look at how we’ve defined the various incident-responder roles:
|Incident Commander (IC)||Drives resolution of site incident. Keeps CL informed of the incident’s impact and resolution status. Stays alert for new complications.
The IC does not perform any technical diagnoses of the incident.
|Tech Lead (TL)||Performs technical diagnosis and fix for incident. Keeps IC informed on technical progress.||Engineering|
|Communications Lead (CL)||Keeps the IC informed on customer impact reports during an incident. Keeps customers and the business informed about an incident. Decides which communication channels to use.||Support|
|Communications Manager (CM)||Coordinates emergency communication strategy across teams: customer success, marketing, legal, etc.||Support|
|Incident Liaison (IL)||For severity 1 incidents only. Keeps Support and the business informed so IC can focus on resolution.||Engineering|
|Emergency Commander (EC)||Optional for severity 1 incidents. Acts as “IC of ICs” if multiple products are down.||Engineering|
|Engineering Manager (EM)||Manages post-incident process for affect teams depending on root cause and outcome of the incident.||Engineering|
The game plan: what happens during an incident response
Let’s consider an example incident. Suppose an engineer on a product team gets paged. The New Relic Synthetics minion that’s monitoring the health check for one of her team’s services is letting her know that the health check is failing. She checks the New Relic Insights dashboard for the service and sees that, indeed, the health check is failing—throughput is dropping, and she’s betting customers are going to be suffering as a result. What happens now? What should she do?
First, she declares an incident in our designated Slack channel. A bot called Nrrdbot (a modified clone of GitHub’s Hubot), helps guide her through the process. Since she’s decided to take the Incident Commander role, she types 911 ic me. This updates the Slack channel header and creates a new, open incident in Upboard (our internal home-grown incident tracker); Nrrdbot direct messages (DMs) the engineer with next steps.
The IC should now do three things:
- Set a severity (how bad is it?).
- Set a title (summary of what’s going wrong) and a status (summary of what’s in progress right now) for the incident.
- Find one or more Tech Leads to debug the problem. If the IC is the best person to be Tech Lead, they will find someone else to take over the IC role, as the IC does not perform any technical diagnoses of the incident.
When the IC sets the severity (or changes it during the course of the incident), that determines who gets brought in to help with the response. For incidents that are at least severity level 3, a team member from support automatically joins the incident as Communications Lead. The CL’s job is to coordinate communication with customers; they’ll relay any customer complaints related to the incident and communicate proactively with customers based on what engineers are finding.
At this point, the IC opens a crowd-sourced coordination document to be shared among everyone who’s participating in the response. She’s responsible for managing the flow of communication between all parties involved in the response. She’s also pulling in support when needed, updating the status (every 10 minutes, or as Nrrdbot reminds her), and updating the severity as things get better or worse.
If the issue hasn’t been resolved in 60-90 minutes, she’ll hand her IC role off to someone else, as it’s an exhausting responsibility, especially at 3 a.m. when awoken from a sound sleep.
Once the issue is completely resolved, and all leads have confirmed their satisfaction, the IC ends the incident by entering 911 over in Slack. This closes the incident.
Finally, she can:
- Collect final details into the coordination document including
- Incident duration
- Customer impact
- Any emergency fixes that need to be rolled back
- Any important issues that arose during the incident
- Notes about who should be involved in the post-incident retrospective
- Confirm who should be invited to the blameless retrospective
- Choose a team to own the incident (in our example, the Synthetics team) so the engineering manager of that team can schedule the post-incident retrospective
The New Relic Emergency Response Force
While they are extremely rare, an incident set with a severity level 1 or 2 automatically triggers a background process that pages a member of the New Relic Emergency Response Force (NERF), and an on-call engineering executive. NERF team members are highly experienced New Relic employees with deep understanding of our systems and architecture, as well as our incident-management processes. They are adept at handling high-severity incidents, especially when those incidents require coordinating multiple teams.
Executives are brought in alongside NERFs to provide three critical functions: inform executive leadership; coordinate with our legal, support, and security teams; and make hard decisions.
Incidents happen, that’s just the reality of our industry. On the plus side, incidents can tell you a lot about your systems and your engineering process, but you waste that valuable information if you don’t systematically learn from outages.
After incidents end, we require teams to conduct a retrospective within one or two business days. Retrospectives are transparent, open to anyone who wants to participate, and the resulting documentation is openly available to anyone in the company.
During retrospectives, we finalize the details of the incident: Which team should own the incident? What was the actual impact to customers, and how long did it last? What was the nature of the problem?
We emphasize holding “blameless” retrospectives, meaning that we’re focused on uncovering the true root causes of a problem, not finding a person or team to take the fall. If you’re not familiar with the blameless approach to incident retrospectives, see Etsy’s classic Debriefing Facilitation Guide: Leading Groups at Etsy to Learn from Accidents, which unpacks both the philosophy and the mechanics of the blameless retrospective. We invest a lot of time and energy in maintaining this blameless culture, in retrospectives and beyond, so that our engineers are empowered to take risks and learn from their mistakes without fear.
Don’t repeat incidents policy: paying down technical debt
In the software world, where moving fast is a primary directive, we often forget to take the time to properly diagnose and address the root causes of serious problems that affect our customers. Sometimes it’s too easy to skip incident follow-up.
At New Relic, if a service incident impacts our customers, the Don’t Repeat Incidents (DRI) policy tells us that we stop any new work on that service until we’ve fixed or mitigated the root cause of the incident.
“The DRI process is a huge part of what makes engineering successful,” says Kevin Corcoran, a senior software engineer on the metrics pipeline team. “It’s an obvious opportunity to identify and pay down technical debt, which is work that usually doesn’t get prioritized through other means. It’s great to have buy-in for the prioritization of DRI work by upper management and executives.”
The challenge isn’t to completely eliminate incidents—that’s not realistic—it’s to respond to the incidents that do occur more effectively. Our approach works for us, but other tools and practices might work better for your company. Regardless of your unique situation, though, be sure you can answer these questions:
- Do you have clear guidelines for responding to incidents?
- Where is the worst friction in incident response and resolution? How can you best reduce that friction?
- Are you constantly creating a blameless culture that treats incidents as learning opportunities instead of blame and shame?
Because, seriously, if you don’t have incidents, you probably aren’t paying attention.
Don’t miss part one of this two-part series: Managing On-Call Rotations the New Relic Way