At New Relic, engineering teams work in small, autonomous teams, usually 4-8 people, with full ownership of the services or features they produce. That means they collaborate—with designers, product managers, and other stakeholders—to build and deploy the thing and to maintain it through successive release cycles. Dealing with incidents is no different; teams respond to pages for their own systems and manage incidents collaboratively instead of delegating response to a centralized operations group.
Notably, though, software teams at New Relic also include Site Reliability Engineers. SREs are software engineers with specialized knowledge, responsible for maintaining a holistic view of the health of a system and for spreading reliability practices—pervasive automation, constant refinement of instrumentation, etc.—throughout teams.
Our SREs leverage the New Relic platform as an essential tool to proactively achieve these goals in three key ways: operationalizing services through instrumentation, experimenting on services through instrumented load testing, and leading teams through incident responses.
1. Operationalizing services
From the SRE point of view, monitoring is about looking at the entire system holistically and understanding how to get visibility at multiple scales and from multiple perspectives. Operationalizing services, then, is about helping teams set up proactive monitoring of their systems—key for baselining performance, helping engineers iteratively and intelligently improve the system, and speeding response to inevitable failures.
For example, let’s look at how we monitor the New Relic Database (NRDB), our proprietary database technology that powers many of our product features.
NRDB is a complex, massively scaled distributed system that consists of three tiers of different worker nodes: ingesting, storing, and querying data. Each tier comprises hundreds of different physical hosts and often thousands of different processes running at any given time. The tiers communicate via a publication-subscription (pub/sub) pattern and streaming systems and are supported by a set of ancillary services that are themselves distributed systems.
Just 18 months ago, NRDB was writing 300 million new data points a minute; it’s now up to 1.5 billion data points per minute, and it scans trillions of stored events and metrics to deliver results when customers query their data.
So, how have our SREs helped the NRDB team proactively monitor this beast?
For a start, we use New Relic APM throughout the NRDB cluster for code-level visibility and debugging. But with a distributed system like this, we can’t measure the “health” of the system from a single node—what we care about are emergent behaviors arising from different nodes working together. So we use New Relic Insights for cluster-wide visibility, tracking metrics like disk I/O and CPU usage across the entire cluster. To test the entire system end-to-end, we rely on New Relic Synthetics. We send an API request to insert data and then send another request to query that data, which helps us ensure everything is working normally.
Of course, if things aren’t working right, we want to know before our customers do. To stay ahead of the game, we’ve set up New Relic Query Language (NRQL) alerts to pinpoint and alert on behaviors that are leading indicators of potential problems. For example, even the smallest increases in processing time for critical queries can slow our customers down, so we need to discover and eliminate the source of those increases as quickly as possible.
Finally, a key source of operational issues is configuration changes, or inconsistent configurations, in the cluster. These hosts are definitely cattle, not pets, but we still need visibility into our herd, so we use New Relic Infrastructure to alert on changes in the cluster’s configuration.
As you can see, our SREs use the entire New Relic platform to improve visibility and maintain the health of a radically complex system, like NRDB.
2. Experimenting on services through instrumented load testing
New Relic SREs help their teams push and stress their services with load testing, which they track with New Relic dashboards. Successful experiments require strong instrumentation to compare the before and after states, and contextual dashboards help teams understand what’s happening in real time.
When the New Relic Browser team does load testing, for example, a critical piece of the exercise involves creating a single dashboard from which they can quickly see how the entire system is behaving.
Metrics like Throughput by Host, Throughput by App, or Throughput by DataType are all monitored elsewhere, but where and how the team visualizes that information during their load testing trials creates important context. The Browser team needs to have the right information in one place, so they can properly monitor their tests.
- The filter at the top of the dashboard lets the team filter the entire dashboard by different environments; they can choose to look at the behavior of the system as a whole or focus on one particular area. This can help disambiguate load-testing issues related to differing infrastructure and configuration rather than to the software itself.
- The dashboard displays the same information in multiple ways, such as looking at processing time by host versus by partition; this helps the team see nuances in how data is moving through the system. To do this, the team either uses chart faceting within single charts or simply displays slightly different charts next to each other.
- The team augments any known errors with a custom attribute, which allows them to differentiate between known errors and new errors they haven’t yet instrumented.
Instrumentation and visualization of load testing is just one way SREs help their teams experiment on their services. Game-days—in which you carefully introduce harmful issues into your system to see how your team resolves them—are another way SREs encourage teams to push the boundaries of their services. Game-days are a great way to determine if a team’s processes for operating a service are effective. More importantly, they help teams ensure that their New Relic alerts and dashboards are properly configured and effective.
3. Leading teams through incident responses
During New Relic’s incident response process, SREs use monitoring to create a shared understanding of the issue so stakeholders can converge on the root cause as quickly as possible.
Further, as reliability experts in their parts of the system, SREs are empowered to dig around in the system as they work to resolve an issue. With an intuitive understanding of the system, they know what obscure things to investigate when the system is behaving strangely. SREs understand how the system is “supposed” to behave (how it was engineered) as well as the ways it actually behaves (the operational reality), which helps them form sophisticated theories based on those mental models.
For example, during an incident, experienced SREs will use to New Relic to:
- Back up assertions: “We’re seeing elevated error rates—here’s the proof.”
- Share context: “Here’s a chart that shows normal behavior; as you can see, what’s happening now is totally abnormal.”
- Form hypotheses: “I see some strange behavior in system X, but not upstream in system Y. I wonder if system Z is the problem?”
- Test hypotheses: “I think if we do X, we’ll get Y, but let’s watch the charts and see if that’s true.”
New Relic’s incident response process has many moving parts, and our SREs are often first responders. Proper instrumentation and visualization of the New Relic platform—and with the New Relic platform—is critical to their success in resolving incidents as quickly as possible.
Don’t just instrument and walk away
The role of SRE varies from organization to organization, and even from practitioner to practitioner. At New Relic, SREs tackle operational problems with a New Relic toolbox that helps them think about the system holistically and proactively, monitoring everything from service level indicators (SLIs) and service level objectives (SLOs) to JVM garbage collection to alert fatigue.
New Relic SREs help other software engineers connect the dots between how they design and implement new features and how the features affect the production environment—breaking down those pesky ops/dev silos. Monitoring and instrumentation allows SREs to create fast feedback loops for their teams as they improve and maintain their systems. The relationship is more dynamic that just “instrument it for us, and walk away”—it’s an incremental and permanent cycle.