One of the most important artifacts to come out of our site reliability work at New Relic is something we call the risk matrix. Essentially, a risk matrix is a list of things teams identify that could go wrong with their services and components. Teams then categorize those risks according to their likelihood and impact.
In the STELLA report, the research group SNAFUcatchers determined that team members who handle the same parts of a system often have different mental models of what the system looks like. How each member of a team thinks their system works, as opposed to how it actually works, can quickly get out of sync as members commit various changes to the system’s code base. All too often, this is a normal state of affairs.
Aligning on the inherent risks in our services and components can help sync our teams on the actual shape of our ecosystem. Achieving this alignment also makes it easier to prioritize reliability work—and actually get it done. Even more important, that clarity translates into business value to the rest of the organization.
So, creating and reviewing risk matrices helps our teams:
- Synchronize our drifting mental models
- Onboard new team members
- Recognize and address hotspots or unexpected gotchas in our systems
- See new risks when things change
- Prioritize reliability work to avoid future disasters
New Relic publishes an internal “risk matrix how-to guide” designed to help teams create and maintain their own matrices. The following four steps for creating a risk matrix were adapted from that guide.
Step 1: Modeling the risk matrix
We organize our risk matrices using the Threat Assessment and Remediation Analysis (TARA) methodology, originally designed to help engineering teams “identify, prioritize, and respond to cyber threats through the application of countermeasures that reduce susceptibility to cyber attack.” More colloquially, TARA is short for “Transfer, Avoid, Reduce, and Accept.” While we don’t use the Transfer category at New Relic, our risk matrices are designed so teams identify how they will take steps to avoid, reduce, or accept a risk.
Avoid: Risks classified here are those a team can fix completely. For example, if a team uses libraries or other dependencies in its service, the team may need to schedule regular maintenance windows during which it can upgrade those libraries or dependencies.
Reduce: These are risks a team cannot completely avoid, but the team can reduce the impact or likelihood of such risks with additional engineering effort. For example, network outages are a risk for all teams, but we can reduce that risk by working to ensure our services gracefully recover after an outage.
Accept: In some cases a team may accept a risk; these are risks a team knows about, but there is little it can do to prevent them. For example, a team has to worry about network downtime, but if its system recovers gracefully after a network outage, the team can categorize that risk as one it accepts. (But the risk of the system not being able to recover gracefully after an outage would belong in the “Reduce” category.)
Step 2: Building the risk matrix
There are many ways to build a risk matrix. At New Relic, we use a kanban board in Jira. The board has three columns for our Avoid, Reduce, and Accept categories. We categorize risks along two axes: Impact and Likelihood; and we have three levels for each axis: High, Medium, and Low (more on these below).
To build an effective risk matrix, teams need to concentrate on their systems’ capabilities and what their customers (internal or external) expect from those capabilities. Asking what can degrade or interrupt the delivery of those capabilities helps teams focus on the risks that matter instead of trying to “boil the ocean” by addressing every possible risk.
If a team has fewer than 10 risks, it might want to think more deeply about its system. Conversely, if it has identified 50 or more risks, it might need to reduce scope and focus more strictly on risks that threaten its ability to deliver customer-facing capabilities. In most cases, we advise teams to toss aside risks that are too general.
Finally, we ask teams to identify risks inherent in their upstream and downstream dependencies, and in the links among their services. A team should also categorize these risks and add them to the kanban board.
When a team is ready to build out its board/risk matrix, we ask the team to schedule at least an hour to do so—and the team must have its engineering manager and product manager present for the exercise. The questions in the table below are designed to guide these discussions:
|Incidents||● How many follow-up incident tickets do you have in your backlog?
● What services have had the most incidents and why?
● Do services need to be restarted after dependent services are restored?
● Is there commonality between team incidents?
|Process||● Does the team have any error-prone manual toil?
● Can the team deploy daily to both staging and production?
● What is the state of the team’s runbooks and documentation?
|Code||● Does the team own any legacy code, and, if so, is it understood?
● How many inherited services/systems does the team own and are they understood?
● How many languages are represented in the codebase?
|Testing||● Is testing coverage comprehensive enough?
● Are regression tests automated and kept up to date?
● What are the number of false positives?
● Are failure modes meaningful and actionable?
|Dependencies||● What are the upstream dependencies and are they well supported?
● Are upcoming changes API-compatible, or are there End-of-Life plans for any of the dependencies?
● What are the downstream dependencies?
● Are libraries up to date and supported long-term?
● Is there reliance on third-party tooling? Is that tooling still supported?
|Unexpected user interactions||● Are requests rate limited, easily configurable, documented, and appropriate?
● Are incoming requests checked for accuracy to make sure they are well-formed to prevent abuse?
● Is there monitoring and alerting on system throughput?
● Are requests authenticated?
|Capacity and scaling||● Have systems been designed with capacity and scaling in mind?
● Is alerting set up to sound an alarm when throughput is consistently at 75% of anticipated capacity?
● Does the team participate in quarterly capacity-planning exercises?
|Monitoring and alerting||● What is the ratio of false positives to real alerts?
● How often does the team learn about incidents from support or end users instead of being alerted by pages?
● Are the same monitoring alerts set up for staging and production?
● Has the team recently verified that its monitoring alerts are still active?
● Is there enough visibility into the system’s failure modes and are there alerts for those modes?
We want teams to generate real work from these risks; the risk-reduction workload needs to be manageable, so that a team actually stands a chance of completing it.
For each risk, teams create a “Risk”-type Jira ticket, and assign it an Impact and Likelihood rating, each on a scale from 1 (high) to 3 (low).
Let’s take a closer look at how we define Likelihood and Impact.
Defining risk Likelihood
To avoid subjective assessments of likelihood, we ask that teams write concrete definitions of low, medium, and high likelihoods appropriate for the team; and use those definitions in the same way, at all times.
- High likelihood: The event that created the risk happened in the last six months.
- Medium likelihood: The event that created the risk happened in the last year.
- Low likelihood: A particular event that could cause a risk hasn’t happened yet, but the team predicts the event could happen.
Defining risk Impact
Defining impact can also be slippery and subjective. In this case, we refer teams to New Relic’s incident severity levels: If a risk were to happen, what would be the severity of the incident would cause? Teams factor in potential incident duration, for example, by assessing the service’s ability to recover if a dependency fails and comes back into service, or if the team’s monitoring and alerting is comprehensive enough.
This table shows how we might apply incident severity to impact:
|4 or 5: Involves minor bugs or minor data lags that affect but don’t hinder customers, or declared simply to raise awareness of something such as a risky service deployment||Low|
|3: Involves data lags or unavailable features||Medium|
|1 or 2: Reserved for cases involving brief, full product outages||High|
Step 3: Working through the risks
After a team has built a kanban board/risk matrix, it works with its product and engineering managers to prioritize the work in the Avoid and Reduce columns. Any risks marked as high likelihood/high impact should get the highest prioritization. We expect that all teams will have one sprint (work a team can do in one to three weeks that stands on its own, delivers durable business value, and could be shipped to its intended audience when it’s complete) per quarter to reduce the risk count in the Avoid column. During the sprint, the team should include a gameday to test assumptions about the risks it’s resolving and to make sure the team’s mental models are aligned with the reality of its systems.
It’s important to note that this scheduled work may not be able to completely eliminate all of the most urgent risks, but the work should dial down the risks’ probability or impact.
Step 4: Reviewing the risk matrix
New Relic expects each team to review its risk matrix at least every eight months, or whenever it onboards a new team member or releases a new service or product.
If a team has existing risks, and it’s not sure if those risks are still relevant, the team should work through the following rules:
- Is the risk too vague? If yes, delete it.
- Does this risk actually belong to a different team? If yes, transfer it.
- Is this risk a security issue rather than an availability issue? If yes, transfer it to the Security team.
- Does the team have control of this risk, or has it done as much work as possible to mitigate it? If yes, review the risk’s impact and likelihood, and put in the Accept column.
- If the team can’t fix the risk, does it have a plan to monitor the risk so the team doesn’t make it worse? If yes, review the risk’s impact and likelihood, and put in the Reduce column.
- Can the team schedule work to mitigate this risk? If yes, review the risk’s impact and likelihood, and put in the Avoid column.
Reviewing risk matrices on a regular basis keeps teams honest as they work to maintain the reliability of their systems.
Reliability is a feature
The risk matrix elevates the visibility of risks, and the matrix exposes those risks to the entire organization so that we can make better decisions as we prioritize work. We’ve tried to make this process as simple as possible—and, ideally, teams should see fewer items in the Reduce and Avoid columns every quarter.
If that improvement isn’t happening the team may be struggling to prioritize its reliability work. This should lead to a conversation with the team’s Site Reliability Champion. The ability to prioritize work that is in service to reducing documented risk means we recognize that reliability delivers business value.