Recently, I met with a friend who is heading up a product group at a new startup. As he told me about his day-to-day work of building out a young organization, my reliability sixth-sense started tingling. I asked, “Do you know what your uptime is?” “99.998%,” he replied, which basically meant the company never experienced incidents of any kind. “Surely you have incidents,” I said, “everyone does. We all do.”
“I don’t know,” he said. “We don’t really keep track.”
As vice president of site reliability at New Relic, I’ve heard from more than one organization that they don’t have a good way to keep track of incidents and their resolutions. My friend was taking an incredible risk, and he didn’t even realize it. I see two big consequences of this: 1) Such organizations end up with no coherent view on how their product or platform is performing; and 2) they end up with no coherent method for how to focus reliability work. They find themselves making decisions based on recency bias, opinions, and internal politics.
At New Relic, we generally define an incident as something that occurs when our system behaves in an unexpected way that might negatively impact our customers. It’s our belief that we should always be collecting data about our incidents, including details about their resolutions and our plans to make sure they don’t happen again. At New Relic, we do this with a tool we call Upboard, a custom application for tracking the full histories of incidents.
Tracking incident data with Upboard
From this point on, anytime an IC changes the status or adds an event in the Slack channel, Nrrdbot logs the status or event to Upboard. Nrrdbot also sends reminder alerts to the current IC to make sure they update the status at least every 10 minutes.
As responders work to resolve the incident, ICs are required to complete a number of fields about the incident, including specifying:
- The team that will take responsibility for the incident
- The impact time of the incident
- The technical details and root cause of the incident
- The triggering event of the incident (for example, capacity limit reached, config change/error, code defect/error, third-party dependency—failure or change, or hardware failure)
We also use Upboard to track follow-up actions (with direct links to Jira where applicable) and link to any retrospectives about the incident. You can also, when applicable, record any messaging to customers regarding the incident.
Finally, one of Upboard’s most crucial features is that responders can’t “close” an incident until the full report has been filled out.
It may seem overkill to capture so much data about every incident, but we’re not just gathering data for the sake of it. What we’re collecting are the “metafacts” about our platform—what really causes incidents? Where are the hotspots in our system? Are there general themes driving incidents (like configuration changes vs. manual operations errors vs. unexpected workloads vs. code defects or regressions)? Are some teams suffering from untenable pager load? Where can we most effectively direct our “reliability work” budget to improve stability and quality? What new processes or training might help?
I can’t imagine trying to answer such questions without supporting data.
How Upboard helps
The data we collect with Upboard helps us in a number of ways. First, it helps us ensure we’re asking the right questions about an incident, from the point it starts until well after its resolution. Our goal is to uncover the true root causes of a problem—and develop a plan for future prevention—rather than finding a person or team to take the fall.
Second, the data we collect with Upboard helps us understand how many incidents a team has had. It’s critical that we identify teams with high risks and find ways to help them prevent future incidents, whether by supporting them to reduce toil or by paying down technical debt where needed.
Finally, our Upboard data gives us invaluable insights into broader organizational patterns and reliability measurements, such as mean time between failures (MTBF) and mean time to repair (MTTR). MTBF measures how long it takes to repair a software failure, and MTTR tracks the amount of time from the onset of an event to the moment when a responder starts the response process. These are invaluable metrics, especially when your engineering teams are required to adhere to specific service level objectives (SLOs), which we also track with Upboard.
Do it for your business
New or old, anyone who hosts complex software systems eventually has to answer the question, Why did this happen? Our maturity is based on how well we’re able to answer that with clean historical data and facts.
You may not yet be ready to build (or even purchase) a fully integrated custom tool to track incidents. When you’re building a young organization, just a simple spreadsheet may be all you need for the moment—a low-cost investment for a high reward. A rich dataset and a robust understanding of how your systems, teams, and organization respond to, and recover from, incidents is the best way to drive operational awareness and excellence in your business.
Beth Long, a software engineer with New Relic’s Reliability Engineering team, contributed to this post.