Matthew Flaming, vice president of site reliability at New Relic, contributed to this post.
This post is adopted from a talk given at FutureStack18: San Francisco and elsewhere titled, “SLOs and SLIs In The Real World: A Deep Dive.”
At New Relic, defining and setting Service Level Indicators (SLIs) and Service Level Objectives (SLOs) is an increasingly important aspect of our site reliability engineering (SRE) practice. It’s not news that SLIs and SLOs are an important part of high-functioning reliability practices, but planning how to apply them within the context of a real-world, complex modern software architecture can be challenging, especially figuring out what to measure and how to measure it.
In this post, we’ll use a highly simplified version of New Relic’s architecture to walk you through some concrete, practical examples of how we define and measure SLIs and SLOs for our own modern software platform.
How we define SLI and SLO
It’s easy to get lost in a fog of acronyms, so before we dig in, here is a quick and easy definition:
When we apply this definition to availability, for example, SLIs are the key measurements of the availability of a system; SLOs are goals we set for how much availability we expect out of a system; and SLAs are the legal contracts that explains what happens if our system doesn’t meet its SLO.
SLIs exist to help engineering teams make better decisions. Your SLO performance is critical information to have when you’re making decisions about how hard and fast you can push your systems. SLOs are also important data points for other engineers when they’re making assumptions about their dependencies on your service or system. Lastly, your larger organization should use your SLIs and SLOs to make informed decisions about investment levels and about balancing reliability work against engineering velocity.
Set SLIs and SLOs against system boundaries
When we look at the internals of a modern software platform, the level of complexity can be daunting (to say the least). Platforms often comprise hundreds, if not thousands, of unique components, including databases, service nodes, load balancers, message queues, and so on. Trying to establish customer-facing SLIs and SLOs for each component may not be feasible.
That’s why we recommend focusing on SLIs and SLOs at system boundaries, rather than for individual components. Platforms tend to have far fewer system boundaries than individual components, and SLI/SLO data taken from system boundaries is also more valuable. This data is useful to the engineers maintaining the system, to the customers of the system, and to business decision makers.
A system boundary is the point at which one or more components expose capabilities to external customers. For example, in the New Relic platform, we have our login service, which represents the capability for a user to authenticate a set of credentials using an API.
It’s likely the login service has several internal components—service nodes, a database, and a read-only database replica. But these internal components don’t represent system boundaries because we’re not exposing them to the customer. Instead, this group of components acts in concert to expose the capabilities of the login service.
Using this idea of system boundaries, we can think of our simplified New Relic example as a set of logical units (or tiers)—a UI system, a service tier (which includes the login service), two separate data systems, and an ingest system—rather than as a tangle of individual components. And, of course, we have one more system boundary, which is the boundary between all of these services as a whole and our external customers.
Focusing on system boundary SLIs lets us capture the value of these critical system measurements, while significantly simplifying the measurements we need to implement.
SLI + SLO, a simple recipe
We can apply the concepts of SLI, SLO, and system boundaries to the different components that make up our modern platform. And although the specifics of how to apply those concepts will vary based on the type of component, we use the same general recipe in each case:
- Identify the system boundaries within our platform.
- Identify the customer-facing capabilities that exist at each system boundary.
- Articulate a plain-language definition of what it means for each capability to be available.
- Define one or more SLIs for that definition.
- Start measuring to get a baseline.
- Define an SLO for each capability, and track how we perform against it.
- Iterate and refine our system, and fine tune the SLOs over time.
Each system boundary has a unique set of functionality and dependencies to consider. Let’s take a closer look at how these considerations shape how we define our SLIs and SLOs for each tier.
Capabilities drive SLIs
Part of the availability definition for our platform means that it can ingest data from our customers and route it to the right place so that other systems can consume it. We’re dealing here with two distinct capabilities—ingest and routing—so we need an SLO and SLI for each.
It’s critical that we start with plain-language definitions of what “availability” for each of these capabilities means to customers using this system. In the case of the ingest capability, the customers in question are the end users of our system—the folks sending us their data. In this case, the definition of availability might look like, “If I send my data to New Relic in the right format, it’ll be accepted and processed.”
We can now use that plain-language definition to determine which metric best corresponds to how it defines availability. The best metric here is probably the number of HTTP POST requests containing incoming data that are accepted with 200 OK status responses. Phrasing this in terms of an SLO, we might say that “99.9% of well-formed payloads get 200 OK status responses.”
A plain-language definition for the data routing capability might look like, “Incoming messages are available for other systems to consume off our message bus without delay.” With that definition then, we might define the SLI and SLO as, “99.xx% of incoming messages are available for other systems to consume off of our message bus within 500 milliseconds.” To measure this SLO—99.95%—we can compare the ingest time stamp on each message to the timestamp of when that message became available on the message bus.
OK, great! We now have an SLO for each capability. In practice, though, we worry less about the SLO than we do about the SLI because SLO numbers are easy to adjust. In fact, we might want to adjust an SLO number for various business reasons. For example, we might start out with a lower SLO for a less mature system and increase the SLO over time as the system matures. That’s why we say it’s important for the capability to drive the SLI.
SLIs are broad proxies for availability
The data ingested by our platform is stored in one of our main data tiers; at New Relic this is NRDB, our proprietary database cluster. In plain-language terms, NRDB is working properly if we can rapidly insert data into the system, and customers can query their data back out.
Under the hood, NRDB is a massive distributed system with thousands of nodes and different worker types, and we monitor it to track metrics like memory usage, garbage collection time, data durability and availability, and events scanned per second. But at the system boundary level, we can just look at insert latency and query response times as proxies for those classes of underlying errors.
When we set an SLI for query response times, we’re not going to look at averages, because averages lie. But we also don’t want to look at the 99.9th percentile because those are probably going to be the weird worst-case scenario queries. Instead, we focus on the 95th or 99th percentile, since that gives us insight into the experience of the vast majority of our customers without focusing too much on the outliers.
At this point, we can configure an alert condition to trigger if we miss our query response time SLI. That lets us track how often we violate this alert, which in turn tells us how often we satisfy our SLI—how much of the time are we available? We definitely don’t want to use this alert to wake people up in the middle of the night—that threshold should be higher—but it’s an easy way to track our performance for SLO bookkeeping.
We articulate these SLIs and SLOs for our data tier, so our customers know what to expect when they query their data. In fact, we can combine several SLOs into a single, customer-friendly, measure of reliability:
Each logical unit gets its own SLO
In addition to our new NRDB data tier, we also have some legacy database systems in our platform that are sharded, meaning that we’ve partitioned one large database into smaller, faster, more easily managed parts called data shards.
To measure our legacy databases, we create SLIs and SLOs for each shard. That’s because, due to the sharding, users’ workloads aren’t distributed between shards.
In a horizontally-scaled service, if you lose one node, you can still reasonably expect that your database can still service two-thirds of your customers’ requests. So you measure the SLO as a logical unit. In the sharded system shown in the following example, losing a single node also creates a 33% error rate, but a third of your customers are actually seeing a 100% error rate while two-thirds see nothing wrong at all.
So for this legacy data tier, we measure SLIs and SLOs separately for each logical instance—in this case each database shard. Like the NRDB tier, the single capability of the legacy tier is query performance, which we measure with SLIs and SLOs for latency, error rate, and response time. We also use the same methodology for separate regions; for example, the EU region of New Relic gets its own set of SLO measurements for each system because it’s a logically separate instance (although we generally reuse the same SLIs).
Measure customer experience to understand SLO/SLIs for UIs
We assign one capability to our UI tier: we expect it to be fast and error-free. To measure UI performance, though, we have to change our perspective. Until now, our reliability concerns have been server-centric, but with the UI tier, we want to measure customer experience and how it’s impacted by the frontend. We have to set multiple SLIs for the UI.
For page load time, for example, we’ll use the 95th or 99th percentile load time rather than the average. Additionally, we’ll set different SLIs for different geographies. But for modern web applications, page-load time is only one SLI to consider.
Hard dependencies require higher SLOs
So far, we’ve explained how we define SLIs and SLOs for different services in our platform, but now we’re going to address a critical part our core infrastructure, the networking tier. These are our most important SLIs and SLOs because they set the foundation for our entire platform. The networking tier is a hard dependency for all of our services.
For this tier, we’ve defined three capabilities: We need connectivity between availability zones (AZs), connectivity between racks within an AZ, and load-balanced endpoints that expose services both internally and externally. We need a higher SLO for these capabilities.
With these layers of dependencies come potential failure scenarios:
- If something goes wrong in the UI tier, it’ll be an isolated failure that should be easy for us to recover from.
- If our service tier goes down, the UI will be impacted—but we can implement some UI caching to reduce that impact.
- If the data tier goes down, the service tier and UI tier also go down, and the UI can’t recover until both the data tier and service tier come back online.
- If the network tier goes down, everything goes down, and we’ll need recovery time before the system is back online. And since systems don’t come back the instant a dependency recovers, our mean time to recovery (MTTR) increases per tier.
In general, we assume that we’ll lose roughly an order of magnitude in uptime for each tier. If we expect an SLO of 99.9% availability for services running on the networking tier, we set an SLO of 99.99% availability for the network itself.
It’s difficult to implement graceful degradation scenarios against hard infrastructure outages, so we invest in reliability at these infrastructure layers and set higher SLOs. This practice is one of the most important things we can do for the overall health of our platform.
One last overall check
Now that we’ve defined SLIs and SLOs for the services that deliver our overall platform, we have a great way to understand where our reliability hotspots are, and our engineering teams have a really great way to understand, prioritize, and defend their reliability decisions. However, we still need to implement one last SLI and SLO check: We need to measure our end-to-end customer experience.
To do this, we run a New Relic Synthetics script that represents a simple end-to-end customer workflow. It sends a piece of data to our platform and then logs in and queries for that specific data. If we detect any significant discrepancy between the performance of this script and the expectations set in our SLOs, we know we need to revisit our SLI methodology.
Six things to remember
In closing, we encourage you to remember the following six points when it comes to setting SLIs and SLOs:
- Define SLIs and SLOs for specific capabilities at system boundaries.
- Each logical instance of a system (for example, a database shard) gets its own SLO.
- Combine SLIs for a given capability into a single SLO for that capability.
- Document and share your SLI/SLO contracts.
- Assume that both your SLOs and SLIs will evolve over time.
- Stay engaged—SLOs represent an ongoing commitment.
It takes a while to build a good reliability practice, but no matter how much time and effort you invest, we strongly believe that you can’t build resilient and reliable software architectures without clear definitions of the demands and availability you’re setting for your systems.