(Editor’s note: This post is adapted from a pair of posts originally published on February 13, 2018.)
Far too many companies continue to use on-call rotations and incident response processes that leave team members feeling stressed out, anxious, and generally miserable. Notably, plenty of good engineers are turning down jobs specifically for that reason.
It doesn’t have to be this way. At New Relic, our DevOps practice has allowed us to create on-call and incident response processes that support rapid growth and maximize the reliability of our systems—while also protecting developers from drama and stress. We hope that by sharing our experiences and best practices for building and managing our on-call rotation and incident response systems, we can help other firms solve similar challenges—and make life easier for their own developers and other practitioners.
On-call policies in action: learning from New Relic
New Relic’s product organization currently comprises more than 50 engineering teams, with more than 400 engineers and managers, supporting some 200 individual services. Each team is its own autonomous unit; teams choose the technology they use, write and maintain their own services, and manage their own deployments, runbooks, and on-call rotations.
Our engineering teams are made up of software engineers, site reliability engineers (SREs), and engineering managers. Most teams bear primary responsibility for at least three services. And every engineer and engineering manager in the organization joins an on-call rotation, usually beginning within the first two to three months of their employment.
We do this, first and foremost, because it’s necessary. New Relic serves thousands of customers around the globe with critical monitoring, alerting, and business intelligence for their applications and infrastructures. When one of our customers has a problem, it’s not an option to let the issue wait until the next day. While we do have engineers across the United States and Europe, most of our teams work from our global engineering headquarters in Portland, Oregon. That means we can’t run “follow the sun” rotations like Google does, in which engineers in one part of the world hand off their on-call responsibilities to peers across the globe at the end of their work days.
Best practice: Adopt and embrace DevOps practices
Before the emergence of DevOps as an application development methodology, on-call duties typically rested on a subset of engineers and other IT personnel, such as a centralized site reliability or operations team.
These staffers—not the developers who actually built the software—responded to incidents involving the services they watched. Feedback from the site reliability team, however, rarely reached the developers. In addition, product owners often chose to move onto the next new feature, instead of urging their teams to pay down technical debt and make their products and services as reliable as possible.
One reason for the emergence of DevOps was to tear down these organizational silos. In a modern application architecture such as the one New Relic uses, services fit together to form a large, interconnected product platform dependent on a complex system of cloud services, database maintainers, and intricate networking layers—just to name a few parts of the system. While a particular incident response may start with one team, the root cause may involve a service further down the stack.
DevOps supports the idea that no team is an island, and that teams must be able to interact and have clear, documented on-call processes to keep these complex systems running smoothly. Additionally, in a strong DevOps practice, developers make better decisions about the services they build because they must also support them—they can’t throw a service over the wall for someone else to worry about.
Best practice: Balance autonomy and accountability
Most teams at New Relic use some form of a one-week on-call rotation, with one engineer as the primary responder and another as the secondary. So, if a team has six engineers, then each engineer will be the primary person on call every six weeks.
A successful on-call process, however, really depends on the composition of the team, the services they manage, and the team’s collective knowledge of the services. This is where team autonomy comes into play—at New Relic, each team creates its own on-call system, which reflects its needs and capabilities.
Here are two examples of how this approach plays out in practice:
The New Relic metrics pipeline team has structured its rotation so that there is always a “primary” and a “non-primary” on-call contact. Using a script that runs in Jenkins, the team randomly rotates the on-call order of the non-primaries. When an incident occurs, if the primary contact is unavailable or doesn’t respond to a page, the non-primaries are paged, one at time, in a random order, until someone responds.
The New Relic Browser team uses a configurable custom application that rotates a new team member into the “primary” role once every minute. If a team member gets paged and doesn’t respond right away, the system rotates to the next person, and so on, until someone acknowledges the alert. This approach actually takes pressure off team members: If a problem does occur and they’re not available or don’t feel prepared to address the problem, another team member is just a two-minute rotation away.
Best practice: Track and measure on-call performance
New Relic tracks several on-call metrics at the individual engineer, team, and group levels:
- The total number of pages per engineer
- The number of hours in which an engineer was paged
- The number of off-hours pages (those that occur outside of normal business hours) received
These metrics, and how you respond to them, are critical to maintaining a structure and organization that allows teams to thrive in their on-call practices. At New Relic, for example, alerting data from PagerDuty allows managers and executives to see how many times a team was paged during a given timeframe, and how many of those alerts occurred off hours.
Tracking off-hours pages helps call attention to teams struggling with unmanageable on-call loads. What’s an unmanageable load? At New Relic, if a team averages more than one off-hours page per week, that team is considered to have a high on-call burden.
If a team’s burden is too high, consider allowing the team to focus on paying down technical debt or automating away toil until their on-call burden drops. Or, like New Relic, you may provide support in the form of senior site reliability engineers (SREs) who can help the team improve their services.
Questions to consider when choosing an on-call model
An on-call model doesn’t have to be complex, but it must ensure that a designated engineer is always available to respond to a page and deal with incidents involving their sphere of responsibility. Some questions that an on-call model should answer include:
- How will the model select team members for each on-call rotation?
- How long will a rotation last?
- What happens when an on-call engineer fails to answer a page?
- What options are available if an engineer doesn’t feel up to the task of handling an on-call page?
- How many engineers will be on call at any given time?
- How will multiple on-call engineers divide their duties?
- How will the team handle unscheduled rotations and other unforeseen events?
For larger organizations with multiple teams, the answer will also depend upon the degree of team autonomy. DevOps organizations generally favor a high level of team autonomy, but some take the concept further than others
Incident response: What happens when the pager goes off
An organization’s on-call process is one key aspect of an organization’s software quality and reliability practices. Another, closely related aspect involves its incident response procedures.
Incident response covers events that run the gamut from mundane to terrifying; some are impossible to notice without the help of specialized monitoring tools, while others could impact millions of users and make national headlines.
New Relic defines an “incident” as any case where a system behaves in an unexpected way that might negatively impact its customers.
New Relic, like many software companies, can’t afford to wait until an incident occurs to figure out a plan. We need to act quickly and efficiently. We have to have a clear plan in place and ready to go.
Best practice: Discover incidents before your customers do
The goal for a successful incident response system is simple: Discover the incident—and, ideally, fix it—before customers are affected by it.
As an organization, our goal is to ensure we never discover an incident because an irritated customer is tweeting about it—that is the worst-case scenario. We’d also like to make sure that we don’t have angry customers calling support, as that’s not an ideal scenario, either.
At New Relic, we like to say that we “drink our own champagne” (it’s nicer than “eating our own dog food”). Engineering teams are free to choose the technologies they use to build services, with one condition: The service must be instrumented. That means it must have monitoring and alerting. (We use our own products except in rare cases.)
Of course, as discussed above, engineering teams also have on-call rotations for the services they manage. A good monitoring setup, with proactive incident reporting, means an engineer will be paged as soon as a problem is detected—preferably before a customer notices it.
Best practice: Develop a system to assess incident severity
Effective incident response begins with a system to rank incidents based on their severity—usually measured in terms of customer impact. New Relic’s internal incident-severity scale makes an excellent starting point for an organization to build its own incident response process; it is based on rankings from 1-5, with clearly documented criteria for each level:
- A Level 5 incident should never have customer impact, and it may be declared simply to raise awareness of something such as a risky service deployment.
- Level 4 incidents involve minor bugs or minor data lags that affect but don’t hinder customers.
- Level 3 incidents involve major data lags or unavailable features.
- Level 1 and 2 incidents are reserved for cases involving brief, full product outages or those that pose a direct threat to the business. At New Relic, the “Kafkapocalypse” from several years ago was an example of this type of incident.
Each incident level involves a specific protocol for calling up internal resources, managing the response, whether and how to communicate with customers, and other tasks. New Relic classifies its most severe incidents as emergencies; these typically require elevated responses, and in some cases, direct involvement, from our legal, support, and leadership teams.
It is very important to consider how an incident may affect customers and impact the customer experience; and to think about the resources a response team will need to diagnose, contain, and resolve the problem.
At New Relic, we assign a severity level during an incident to determine how much support we need. Then, after an incident, we reassess the assigned severity level based on actual customer impact. This reflects a key incident response principle at New Relic: We encourage engineers to escalate quickly during an incident so that they can get the support they need to resolve the problem. After the incident is over, we assess the actual impact and downgrade the severity if it turns out the impact wasn’t as bad as initially feared.
Best practice: Define and assign response team roles
The following table provides an overview of the roles that New Relic uses to staff its incident response teams. Many of these roles enter the picture at specific severity levels. In other cases the responsibilities assigned to a role may change depending on the severity of an incident:
|Incident Commander (IC)||Drives resolution of site incident. Keeps CL informed of the incident’s impact and resolution status. Stays alert for new complications.|
The IC does not perform technical diagnoses of the incident.
|Tech Lead (TL)||Performs technical diagnosis and fix for incident. Keeps IC informed on technical progress.||Engineering|
|Communications Lead (CL)||Keeps IC informed on customer impact reports during an incident. Keeps customers and the business informed about an incident. Decides which communication channels to use.||Support|
|Communications Manager (CM)||Coordinates emergency communication strategy across teams: customer success, marketing, legal, etc.||Support|
|Incident Liaison (IL)||For severity 1 incidents only. Keeps Support and the business informed so IC can focus on resolution.||Engineering|
|Emergency Commander (EC)||Optional for severity 1 incidents. Acts as “IC of ICs” if multiple products are down.||Engineering|
|Engineering Manager (EM)||Manages post-incident process for affected teams depending on root cause and outcome of the incident.||Engineering|
Best practice: Set up an incident response scenario
Most organizations can’t fully simulate an actual incident response—especially a high-severity incident. But even limited simulations can give you a sense of what will happen during an incident, how to set priorities and escalation procedures, how to coordinate team roles, and other key insights.
Let’s look at an example, involving a hypothetical incident at New Relic:
Our simulation begins with an on-call engineer on a New Relic product team getting a page. The New Relic Synthetics minion that’s monitoring the health check for one of the engineer’s services is letting her know that the health check is failing. She checks the New Relic Insights dashboard for the service and sees that, indeed, the health check is failing—throughput is dropping, and she’s worried customers will suffer as a result. What happens now? What should she do?
First, she declares an incident in our designated Slack channel. A bot called Nrrdbot (a modified clone of GitHub’s Hubot), helps guide her through the process. Since she’s decided to take the Incident Commander role, she types 911 ic me. This updates the Slack channel header and creates a new, open incident in Upboard (our internal home-grown incident tracker); Nrrdbot direct messages (DMs) the engineer with next steps.
The IC should now do three things:
- Set a severity (how bad is it?).
- Set a title (summary of what’s going wrong) and a status (summary of what’s in progress right now) for the incident.
- Find one or more Tech Leads to debug the problem. If the IC is the best person to be Tech Lead, they will find someone else to take over the IC role, as the IC does not perform technical diagnoses of the incident.
When the IC sets the severity (or changes it during the course of the incident), that determines who gets brought in to help with the response. For incidents that are at least severity level 3, a team member from support automatically joins the incident as Communications Lead. The CL’s job is to coordinate communication with customers; they’ll relay any customer complaints related to the incident and communicate proactively with customers based on what engineers are finding.
At this point, the IC opens a crowd-sourced coordination document to be shared among everyone who’s participating in the response. She’s responsible for managing the flow of communication between all parties involved in the response. She’s also pulling in support when needed, updating the status (every 10 minutes, or as Nrrdbot reminds her), and updating the severity as things get better or worse.
If the issue hasn’t been resolved in 60-90 minutes, she’ll hand her IC role off to someone else, as it’s an exhausting responsibility, especially at 3 a.m. when awoken from a sound sleep.
Once the issue is completely resolved, and all leads have confirmed their satisfaction, the IC ends the incident by entering 911 over in Slack. This closes the incident.
Best practice: Hope for the best, but plan for the worst
The example above simulates a significant incident at New Relic, but it never rose to the level of a true emergency. Emergency events are extremely rare (or they should be, anyway), but they pose an exponentially higher level of risk to a business. In fact, during a true worst-case scenario, an incident could turn into an existential threat if it escalates out of control.
At New Relic, an incident set with a severity level 1 or 2 automatically triggers a background process that pages a member of the New Relic Emergency Response Force (NERF), and an on-call engineering executive. NERF team members are highly experienced New Relic employees with deep understanding of our systems and architecture, as well as our incident-management processes. They are adept at handling high-severity incidents, especially when those incidents require coordinating multiple teams.
Executives join an incident response team alongside NERFs to provide three critical functions: inform executive leadership; coordinate with our legal, support, and security teams; and make hard decisions.
Best practice: Use incidents to learn, improve, and grow
As a first step toward capturing knowledge and learning from an incident, the New Relic IC in our example would also perform several post-incident tasks:
- Collect final details into the coordination document including
- Incident duration
- Customer impact
- Any emergency fixes that need to be rolled back
- Any important issues that arose during the incident
- Notes about who should be involved in the post-incident retrospective
- Confirm who should be invited to the blameless retrospective
- Choose a team to own the incident (in the example above, the Synthetics team) so the engineering manager of that team can schedule the post-incident retrospective
We also require teams to conduct a retrospective within one or two business days after an incident. New Relic organizes “blameless” retrospectives designed to uncover the root causes of a problem—not find a scapegoat. Learn more here about how New Relic structures and uses blameless retrospectives as part of its broader commitment to DevOps best practices.
Best practice: Implement a Don’t Repeat Incidents (DRI) policy
At New Relic, if a service incident impacts our customers, we have a Don’t Repeat Incidents (DRI) policy that compels us to stop any new work on that service until we fix or mitigate the root cause of the incident. The DRI process plays a big role in the success of New Relic’s engineering teams—ensuring that they identify and pay down technical debt, which is work that often doesn’t get prioritized through other means.
It’s important to remember that the goal isn’t to completely eliminate incidents—that’s simply not realistic. Instead, New Relic wants its teams to respond more effectively to future incidents that do occur.
Now it’s your turn: questions to guide incident response planning
We’ve covered a lot of ground discussing how New Relic handles our on-call and incident response processes and suggesting best practices that you can take away from our experiences. We encourage you to create clear guidelines so your teams know what to expect; to identify and reduce the worst friction in your incident response and resolution processes; and to decide exactly how to structure your on-call and incident response processes.
Addressing the following questions can help you to perform all of these tasks more efficiently.
- Size: How big is the engineering organization? How big are individual teams? What kind of rotation can your teams handle?
- Growth: How fast is the engineering organization growing? What’s the turnover rate?
- Geography: Is your organization geographically centralized or widely distributed? Do you have the size and distribution to institute “follow the sun” rotations, or do engineers need to cope with off-hours pages?
- Organization: How is the engineering organization structured? Have you adopted a modern DevOps culture in which teams own the full lifecycle of a service from development to operations, or are development and operations siloed? Do you have a centralized SRE group, or are SREs embedded on engineering teams throughout the organization?
- Complexity: How are your applications structured? Do your engineers support well-defined services that are plugged into a larger architecture, or is your product a monolithic application supported by different teams? How many services does each team support? How stable are those services?
- Dependencies: How many customers (internal or external) depend on your services? If a service fails, how big is the blast radius?
- Tooling: How sophisticated are your incident response process and tools? How thorough and current are your team’s runbooks and monitoring? Do engineers have adequate tooling and organizational support when they respond to a page? Do engineers get automatic, actionable notifications of problems?
- Expectations: Is being on call the norm in your engineering culture? Is it seen as a valuable and essential part of the job, or as an extraneous burden?
- Culture: Does your company have a blameless culture that focuses on true root cause and addressing systemic issues, or do you have a “blame and shame” culture where people are punished when something goes wrong?