(Editor’s note: This post previously appeared on The New Stack.)
Here’s a statement any engineering manager can agree with: Reliability is complicated.
Modern software teams face no shortage of edge cases and variations across the service categories and tiers of their ever-evolving architectures. In the midst of leading a team through the day-to-day firefighting, it can be difficult to see the forest for the trees. But as managers, we know our teams face similar trials: defects and regressions, capacity problems, operational debt, and dangerous workloads affect all of us.
And then there is the complexity of scale, something New Relic knows about first hand. The New Relic platform includes more than 300 unique services and petabytes of SSD storage that handle at least 40 million HTTP requests, write 1.5 billion new data points, and process trillions of events … every minute. The platform is maintained by more than 50 agile teams performing multiple production releases a week. To cope with serious scale like this, engineering teams must be nimble and fast moving. Their managers must also ensure that their teams adhere to reliability processes that support this kind of complexity and scale. So how do we do it at New Relic?
Engineering managers at New Relic use the following seven questions to determine if their teams (or services) are meeting the essentials of our reliability best practices. Take a look and ask yourself, how do your teams stack up?
Also on the New Relic blog: Best Practices for Setting SLOs and SLIs For Modern, Complex Systems
Question 1. Are your deploys and rollbacks bulletproof?
Regardless of whether you deploy once a month or once an hour, robust and speedy deploy and rollback tooling is critical to running reliable software. Would you be comfortable deploying every day for a week, and rolling back at least half of those deploys to test your machinery—and having any member of your team do it?
If your answer is “no,” either because you’re not confident in your tooling or because you have too much toil, it’s time to invest more team resources in rehearsing deploys or optimizing your tooling.
Question 2. Have you run a game-day lately?
Your team’s effectiveness at speedily resolving incidents is all about preparation. Game-days—in which you introduce harmful issues into your system to see how your team resolves them—are the best way to test that team’s processes for operating a service, and to make sure their alerts and dashboards are properly configured. Game-days are especially useful when your team adds a new service, but they’re also a great way to ramp up new team members and train them on how your team works together.
If you haven’t run a game-day since adding a new service or team member, you’re overdue. Ironically, this is particularly true if your team doesn’t encounter critical issues very often—the fewer incidents you have, the easier it is to get out of practice!
Also on the New Relic blog: How to Run an Adversarial Game Day
Question 3. Are you reliably catching regressions before full production rollout?
Is your team equipped to reliably catch regressions before customers see them, or before they impact more than 10% of your customers?
Your team’s pre-production environment should have tools in place to catch major defects, configuration errors, and significant performance regressions before they deploy new code or code changes to production. But in reality, pre-production tests can never catch everything, particularly at scale.
To limit the customer impact of any defects that make it to production, modern software teams often use canaries or feature flags in their deploy processes. Canaries or feature flags also make it easier to respond quickly if things go wrong, and partial rollbacks tend to be faster and safer than full redeploys. Partial rollbacks also carry less risk of thundering-herd restart issues.
Have your team research the technical or processes changes they’d need to work with these types of deploys, and provide them avenues to reduce the friction of adopting such tools however you can.
Question 4. When was the last time you time you updated your risk matrix? Do you even have one?
A risk matrix can help give your team a clear understanding of the areas in your software where incidents could occur and their potential severity. Essentially, they’re a key method of identifying high-likelihood/high-impact scenarios that you should prioritize addressing before they become incidents. Risk matrices can also help your team identify where you need alerting and incident runbooks in place, particularly for incidents that have the potential of medium impact or higher.
Since your systems are always evolving, it’s important to revisit your risk matrices to make sure they keep up; you should update them at least once every eight months, but ideally whenever you add a new service. Use the risk matrix to verify that you have alerts and runbooks for all risk matrix entries that are medium impact or above.
Question 5. How much free capacity do you have in each of your service tiers?
Capacity bottlenecks are a major cause of service disruptions, and running without enough free capacity makes your team’s systems more vulnerable to workload or latency changes, or even small—but noticeable—performance regressions.
Further, does your team have a safe margin of free capacity online? At New Relic, our reliability experts recommend that teams generally maintain 30% free capacity, or enough to cover 90 days of workload growth, for every service they own. We also make sure we’ve deployed at least N+2 instances of each service for redundancy, even if they’re not needed to support workload growth. It’s important to note that the less accurately you can measure your capacity, the more conservative you need to be about your estimates. Remember to measure free capacity based on the hotspots in your system, not on averages.
Question 6. Can you defensively rate limit?
Is your team prepared to keep one bad actor from taking down your entire system? If a few clients start generating too many queries, POSTs, or API calls, it’s critical that you have the ability to selectively drop or limit the workload so your service doesn’t go offline.
If your team isn’t adequately prepared, you may want to invest in automatic overload protection. At a minimum, though, your team should set in place runbooks and manual controls for shedding dangerous workloads or data that you can’t handle during an incident.
Question 7. Can your systems scale without meaningful architectural changes for the next 12 months?
In today’s modern software architectures, scaling inflection points are some of the biggest overall reliability risks. Hitting an inflection point often requires significant work or collaboration between several teams to address the issue.
Work with your team to assess the next 12 months—if you don’t think you can get through the year without architectural or process changes, make sure all your stakeholders know changes are coming, and make plans to deliver the “necessary work” before it’s too late. But don’t embark on that “necessary work” without carefully defining it.
So, how’d you do?
If you’ve answered “no” to any of these questions, DON’T PANIC. But don’t shrug and move on, either.
Here’s the thing about reliability work: It’s never done—for anyone. If we’re not actively improving, we’re falling behind. By not updating our systems as our platforms scale, or refreshing our mental models of our systems’ architectures, we’re inevitably going to fall behind. And once we fall behind, it’s hard to catch up.
It’s probably unrealistic to expect your team to be able to properly answer all seven questions in a single quarter, or even two. But that’s no reason to give up. One-step-at-a-time improvements, delivered continually, can add up to big results.