If you were offered a job that required you to be part of an on-call rotation, would you accept it or turn it down? I’ve known plenty of smart, ambitious engineers who have turned down great jobs because they couldn’t bear the thought of having to be on call. They made that decision because at far too many companies being on call is often exhausting and frustrating—if not downright toxic.
A company’s on-call policy can be a lens through which you can see the health of the entire organization. Healthy on-call rotations require a supportable service architecture, well-balanced team size and composition, and a culture that values the entire lifecycle of services, from design to deployment to maintenance.
Part one of this two-part blog series looks at what it takes to develop a good on-call practice, and shares some of the lessons we’ve learned at New Relic. In part two—Managing Incident Response at New Relic—we take a deep dive into our incident response process.
On-call practices require a structured system and organization
New Relic’s Product organization comprises 57 engineering teams, with more than 400 engineers and managers, supporting some 200 individual services. Every engineer and engineering manager in the Product organization joins an on-call rotation, usually within the first two to three months of their employment.
Our engineering teams are made up of software engineers, site reliability engineers (SREs), and engineering managers. Each team is its own autonomous unit; teams choose the technology they use, write and maintain their own services, and manage their own deployments, runbooks, and on-call rotations. Each team, on average, bears primary responsibility for at least three services. All team members go on call for their team’s services.
New Relic serves thousands of customers around the globe, providing critical monitoring, alerting, and business intelligence for our customers’ applications and infrastructures. When one of our customers has a problem, it’s not an option to let that problem wait until the next day. While we do have engineers across the United States and Europe, the majority of our teams work from our Engineering HQ in Portland, Oregon. That means we can’t run “follow the sun” rotations like Google does, in which engineers in one part of the world hand off their on-call responsibilities to peers across the globe at the end of their work days.
There are great benefits to having autonomous teams whose engineers all join an on-call rotation for the services their teams build. But these benefits are recognized only if the broader culture supports the good practices that make being on call a manageable process and not a nightmare.
At New Relic, we’ve worked hard to structure both our systems and our organization to make it easier for us to meet these challenges.
It’s all about the DevOps mindset
Before the growing popularity of DevOps within engineering organizations helped to break down barriers between development and operations, on-call duties typically rested on a subset of engineers, such as a centralized site reliability or operations team. In a non-DevOps world, these engineers would be responsible for resolving any issues in the services they watched, but they often didn’t have the power to apply post-incident feedback to those responsible for the design and development of said services. And even if there were communication channels for them to make recommendations, developers who didn’t have to wake up at 3 a.m. to troubleshoot their broken services often lacked the same sense of urgency. Product owners would often feel it was more important to move onto the next new feature instead of urging their teams to pay down technical debt.
When you build a service and you’re on call for that service, you make different decisions about it than when you can throw it over the wall for someone else to support. On any given team at New Relic, each member must understand the full lifespan of their service, from framework to packaging to deployment.
This DevOps mindset, of course, means no one team is an island. Our services all fit together to form a large, interconnected product platform dependent on a complex system of cloud services, database maintainers, and intricate networking layers—just to name a few parts of the system. It’s not uncommon for an incident to start with one team while the root cause actually lies in a service further down the stack. Because of our product’s architecture, teams must be able to interact and have clear, documented on-call processes.
On-call rotations at New Relic
Most teams at New Relic use some variation of a one-week on-call rotation, with one engineer as the primary responder and another as the secondary. So, if a team has six engineers, then each engineer will be the primary person on call every six weeks. That’s not so bad.
The sustainability and burden of any approach, though, really depends on the composition of the team, the services they manage, and the team’s collective knowledge of the services. Again, this is where team autonomy comes into play. As Kevin Corcoran, a senior software engineer on the metrics pipeline team points out, the most important part of how New Relic does on call “is that each team is allowed to design and implement their own on-call system.”
“Shortly after joining,” Corcoran says, “I realized that other teams within engineering had different systems that were more flexible and/or more aligned with their on-call responsibilities. This sparked a series of discussions within my team that led to us creating and implementing our own on-call system, which really just combines the best parts of a few other systems that were already in use.”
Corcoran’s team has structured its rotation so that there is always a primary and a “non-primary.” Using a script that runs in Jenkins, the team randomly rotates the on-call order of the non-primaries. When an incident occurs, if the primary is unavailable or doesn’t respond to a page, the non-primaries are paged one at time, in a random order, until someone responds.
“That level of freedom and autonomy is not something I was accustomed to from previous jobs,” Corcoran says, “and it’s been really refreshing.”
The New Relic Browser team, on the other hand, takes an entirely different approach. They’ve created a configurable custom application that rotates a new team member into the “primary” role once every minute. If a team member gets paged and doesn’t respond right away, the system rotates to the next person, and so on, until someone acknowledges the alert.
Some team members were skeptical of this approach. After all, there’s a certain comfort in knowing you’re on call for a particular week; you have your laptop and phone with you all the time, and then when you go off call, you can relax. “Last October we discussed as a team whether to continue with this structure,” says Honey Darling, the team’s engineering manager. “Some of the newer team members were uncomfortable with the idea. They didn’t like the idea that they would have a lower level of pressure but always have to be ‘on.’ But it’s worked out. Now the team likes this approach and doesn’t want to change.”
The team has an unusually low pager burden, which has contributed to the success of this model. “If we had more incidents, it might not work as well,” Honey points out. But fewer incidents is a mixed blessing. Now the team has to deal with not being as familiar with incident response when something does go wrong. Still, the Browser team members have embraced this unique approach to on call, enjoying the knowledge that if a problem does occur and they’re not available or don’t feel prepared to address the problem, another engineer is just a two-minute rotation away.
Monitoring incidents? You bet your metrics
Helping customers instrument their code is the lifeblood of our company, but we also instrument our engineering organization. We track metrics that include the total number of pages per engineer, the number of hours in which an engineer was paged, and the number of off-hours pages (those that occur outside of normal business hours) received. We track the same metrics at the team and group levels. For example, for off-hours pages we monitor our alerting data from PagerDuty, and can show managers and executives how many times a team was paged during a given timeframe, and how many of those alerts occurred off hours.
Tracking metrics like off-hours pages helps call attention to teams who are struggling under unmanageable on-call loads. What’s an unmanageable load? At New Relic, if a team averages more than one off-hours page per week, that team is considered to have a high on-call burden. It’s important for us to stay on top of this because problems that occur outside of business hours need immediate attention and can cause our engineers a lot of fatigue and unhappiness.
When a team’s burden is too high, we may shift priorities, allowing them to focus on paying down technical debt or automating away toil until their on-call burden drops. Or we may provide them support in the form of senior SREs, who can help the team improve their services.
Gathering these metrics and forming the right responses is a critical part of ensuring that we maintain a structure and organization that allows our teams to thrive in their on-call practices.
Toward a better on-call practice
In thinking about planning for or revising an on-call practice, I’ve found that it’s useful to address the following issues:
- Size: How big is the engineering organization? How big are individual teams? What kind of rotation can teams handle?
- Growth: How fast is the engineering organization growing? What’s the turnover rate?
- Geography: Is your organization geographically centralized or widely distributed? Do you have the size and distribution to institute “follow the sun” rotations, or do engineers need to cope with off-hours pages?
- Organization: How is the engineering organization structured? Have you adopted a modern DevOps culture in which teams own the full lifecycle of a service from development to operations, or are development and operations siloed? Do you have a centralized SRE group, or are SREs embedded on engineering teams throughout the organization?
- Complexity: How are your applications structured? Do your engineers support well-defined services that are plugged into a larger architecture, or is your product a monolithic application supported by different teams? How many services does each team support? How stable are those services?
- Dependencies: How many customers (internal or external) depend on your services? If a service fails, how big is the blast radius?
- Tooling: How sophisticated are your incident response process and tools? How thorough and current are your team’s runbooks and monitoring? Do engineers have adequate tooling and organizational support when they respond to a page? Do engineers get automatic, actionable notifications of problems?
- Expectations: Is being on call the norm in your engineering culture? Is it seen as a valuable and essential part of the job, or as an extraneous burden?
- Culture: Does your company have a blameless culture that focuses on true root cause and addressing systemic issues, or do you have a “blame and shame” culture where people are punished when something goes wrong?
What happens when the pager finally goes off?
All of these great on-call practices aren’t worth much if we don’t have a good incident response process in place. Part two of this series—Managing Incident Response at New Relic—covers the incident response process developed by our reliability engineering team.