New Relic has experienced remarkable growth in size and complexity over the past few years. Each minute, New Relic’s systems handle more than 30 million HTTP requests, 600 million new data points, and over 50 billion events queried. We have more than 200 unique services operated by more than 40 engineering teams, dealing with more than 4 petabytes of stored data.
Scaling our reliability practices to match this growth has proved a challenge. “Reliability” can be defined many ways, but broadly speaking, a reliable system is one that is stable, predictable, and highly available. Evolving effective reliability practices requires iteration and adaptation as an organization collectively figures out what works and what doesn’t.
Back in October 2014 New Relic experienced a high-severity, long-running incident that showed us how much our company growth had outstripped our existing process. We had a hard time resolving the incident, communicating internally and externally, and applying lessons learned afterwards. It was something of a wake-up call for us.
Just one year later, the company experienced another high-severity incident, but our response was very different. Our customers knew what was happening, our response was calm and organized, and we were able to resolve the incident quickly. Now we could have a high-severity incident and be proud of our response as an organization.
So what changed in those fourteen months?
Last month, I had the opportunity to tell our reliability story on stage alongside New Relic Engineering VP Matthew Flaming at our FutureStack: London event, and to share seven lessons we learned the hard way about the importance of reliability in large-scale systems.
Reliability in seven (not so) easy lessons
Lesson 1: Don’t wait for perfect answers.
Immediately after the October 2014 incident, we implemented new systems to help bring order to chaos. Within a month we rolled out email distribution lists, a Change Acceptance Board (CAB), and a new incident management tool.
If duct tape is what you have, start with duct tape and iterate from there.
Lesson 2: Start manually, then automate.
Some of those fast turnaround processes didn’t quite work. Our CAB proved unpopular with our engineers and ended up increasing friction for releases, leading to bigger, less frequent deployments. Ironically, that increased the risk that a release would cause an incident.
So we automated the CAB process for low-risk releases, which led to happier engineers and better results. We also created Gatekeeper, a “pre-flight checker” that automated the release process even further. Rather than creating arbitrary rules, we shifted to encouraging engineers to think wisely about risk.
Focus on clarifying the actual problem you’re trying to solve, then remove as much friction as possible with progressive automation.
Lesson 3: MTTR is (mostly) about process and people.
The first round of tools we rolled out helped give us more visibility into incidents, but there was no alternative between going it alone and alerting the entire organization. So we introduced a process for declaring a severity level, to differentiate between “we’re experiencing some lag” and “the entire site is down!” We also added our senior staff to the on-call rotation alongside engineers. Now we could be more selective about how and when we sounded the alarm.
Then we created the New Relic Emergency Response Force to step in during the worst incidents and make sure things run smoothly.
Reeling in your Mean Time to Resolution (MTTR) requires refining your processes so that incidents become manageable, and having the right people in place when things go sideways and your runbooks can’t keep up.
Lesson 4: Define realistic, concrete metrics of reliability.
For a while we tried aiming for 100% reliability. That didn’t go so well.
We figured out that “How good is our reliability?” is the wrong question to ask, because “reliability” means different things to different people. Instead, you need a common empirical language: What constitutes availability for your services? What level of availability for different services satisfies your customers? Define concrete metrics, and, as always, iterate!
Lesson 5: Make the right answer easy every single time.
At first we focused on resolving incidents faster, and that worked. Then we realized we were recycling the same issues over and over again.
That was the genesis of our “Don’t Repeat Incidents” process, which unlocked a new era of stability (and well-rested engineers):
For all incidents that cause an SLA violation, all merges to master for that team are halted, except changes directly related to fixing the root cause of the incident.
Of course, common sense applies: in cases where a fix is infeasible, reasonable steps to reduce the likelihood of a repeat incident, or the severity of impact if the issue does recur, will suffice.
It’s not enough to get the right answer once. Create processes so that the right answer becomes inevitable.
Lesson 6: Don’t wait for incidents to reveal your problems.
We introduced a service maturity model that gave us a framework for approaching reliability:
- Identify known risks in a risk matrix, assigning likelihood and severity to each risk.
- Monitor and alert, starting with high-likelihood/high-severity risks and work down from there.
- Mitigate known risks, moving items as far away from high-likelihood/high-severity as possible.
Once you’ve mitigated known risks, conducting game-day and chaos testing will help uncover new issues before they surprise you.
Lesson 7: Autonomy ≠ going it alone
Our Site Reliability Engineers (SREs) are embedded in teams, working on the daily challenges of reliability for their services.
Our Reliability Engineering team optimizes reliability tools and processes, and evangelizes best practices throughout the organization.
Site Reliability Champions work on reliability issues that cross team boundaries.
Our culture prizes autonomy, but that doesn’t mean we can leave teams to tackle reliability on their own. Teams (and individual SREs) need organizational support and investment to thrive.
We’ve tried a variety of things that worked poorly or not at all, and at each step we’ve learned from our mistakes. Figuring out reliability in complex systems isn’t a graceful, orderly process but an iterative, evolutionary one.
We’ve changed a lot since October 2014 and expect to change a whole lot more as we continue to grow and evolve.
Want to hear more? Join us at one of our upcoming FutureStack events—Berlin is next on June 23, with New York City following that on September 13 and 14. We’ll be sharing our reliability story in more detail and we would love to hear yours.
The London event sold out and and we are seeing similar demand in Berlin, so be sure to register today to save your spot.
Note: Event dates, speakers, and schedules are subject to change without notice.