At New Relic, we treat each of our 49 engineering teams as an autonomous entity, with full end-to-end responsibility for the services they own. It’s our philosophy that to build great software, teams need to be responsible for how well it operates. As part of that philosophy, we put every engineer on call.
This might sound scary and terrible, but we’ve set a lot of careful policies and practices in place to ensure it’s not. Engineers, managers, and other stakeholders expected to join on-call rotations receive incident response training that covers on-call preparation, expectations of the Incident Commander (IC), and an overview of the templates and tools we use for managing incidents.
Birth of the NERF
Initially we wanted every engineer to manage any incident on which they were the first responder. However, we soon realized that although most of our engineers were well equipped to handle incidents impacting services owned by their own teams, successfully managing large, complex incidents crossing multiple systems requires an advanced level of incident-management skills. So we created the New Relic Emergency Response Force, or NERF. The NERF program is a rotation of on-call volunteers who assist during long-running and high-impact emergencies.
To kick off the program, we recruited a group of individuals based on a set of criteria including a proven ability to handle high-severity incidents (especially incidents requiring coordination across multiple teams); a deep working knowledge of the New Relic platform; and strong communication, facilitation, and collaboration skills.
Call a NERF when you need one
There’s no single factor that contributes to long-running incidents, but looking back over a year or so of data collected with our internal incident management tooling, we saw that incidents tended to drag on when they didn’t immediately receive focused attention from the full set of teams that could help resolve them. Worst-case scenario: a single engineer trying to act as both Incident Commander and Tech Lead in the middle of the night, and executing neither role well.
We’ve since tweaked our Incident Response training to strongly discourage responders from going it alone, instead encouraging them to quickly enlist help when needed—especially from a NERF.
The NERF on call can be paged directly from our Emergency Room Slack channel using our custom Slackbot Nrrdbot. We’ve also implemented a time-based escalation tool that automatically pages a NERF for high-severity incidents.
When a NERF joins an incident response team, they take over as Incident Commander. Their level of experience or expertise is critical in high-severity situations. This standard process also eliminates any judgement of whether or not the current IC is doing a “good job” of IC-ing. This takes a lot of pressure off the original Incident Commander to help focus on the technical problem rather than coordination of work, with the added bonus that once the incident is resolved, the NERF also facilitates the “blameless retrospective.” Having a consistent group serve as facilitators helps us ask the right questions, and keeps us focused on uncovering the true root causes of a problem rather than finding a person or team to take the fall. This type of retro ensures that we follow a common process and learn as much as possible about the incident during the discussion.
How NERFs help engineering teams
The NERF program allows engineers and engineering teams to focus on the problem, rather than the mechanics of incident management. It also brings consistency and best practices to our incident-management process. Additionally, if a particular team’s services are at the root cause of an incident, NERFs are in a better position to identify common problems or remediation needed across multiple teams. Essentially, a NERF can step into an incident and not lose sight of the forest for the trees—having a full picture of the incident, and how to manage it, is just as important as determining the root cause.
Why become a NERF?
We encourage a variety of stakeholders from around the engineering organization to join the NERF program. It’s a fantastic way for them to share their knowledge, skills, and leadership with each other and the rest of the organization during and after incidents. They’re also able to help us continually improve our incident-response process.
NERFs also help improve processes for key performance indicators (KPIs) such as Mean Time To Repair (MTTR), which measures how long it takes to repair a software failure. They also help us experiment with ways to instrument Mean Time To Detect (MTTD), which tracks the amount of time from the onset of an event to the moment when a responder starts the response process.
No engineer loves to be on call. More important, no engineer wants to handle an incident alone. We’ve worked hard to ensure all incident responders have the tools and support they need to get through at least the initial stages of a response. Our NERF program is designed to help us continue to improve our approach to managing and resolving incidents as swiftly and painlessly as possible, while learning as much as we can from each one.
At New Relic, we take reliability seriously, and our NERFs volunteer because they know they can have a significant impact across the organization. They’re highly motivated to make the tools and process better for themselves, their peers, and our customers.