For those of us who spend a lot of time thinking about what a great modern SRE practice should look like in a DevOps world, the Site Reliability Engineering book serves as a fantastic point of reference. Written by members of Google’s SRE team, the book shares a compelling glimpse of how they scale and operate their cloud platform and SaaS products.

But what about SRE practices at companies that aren’t the size of Google? For all that’s been written about reliability practices, it’s surprisingly hard to find specific, detailed descriptions of the day-to-day role SREs play in other engineering organizations. Most descriptions on the internet contain relatively vague phrases, like “SREs combine software engineering and operational skillsets,” and “SREs automate all the things.”

Of course, some companies have great, robust internal descriptions of how their SREs support teams in their engineering organizations—but even there, “SRE” is often used as a catchall term meaning “operations engineers, or engineers who support infrastructure components, or who write code but also spend a lot of time doing other things.” For a long time, this was how we used the term at New Relic. We knew roughly what our SREs were supposed to be doing, but disagreement about the specifics—like how much manual operational toil it was acceptable for an SRE to take on, or how SREs should engage with teams in architecture discussions—sometimes made it challenging for our SREs to prioritize the most high-leverage, high-value work.

Working toward clarity and consensus

The process of creating our own SRE role description took time and involved the input of a variety of stakeholders—from individual SREs to executive leadership. This was a worthwhile investment: the exercise helped us clarify and shape a shared understanding of

  • Why we have SREs at New Relic.
  • The vision for our SRE team.
  • How SREs can most effectively contribute to the future of our platform.

This clarity also gave our SREs and their managers tools for calibrating expectations, identifying failures, and targeting success.

Our experience suggests that engineering organizations can benefit by creating clarity and consensus around what they expect from their SREs. To support that effort, we want to share our internal definition of the SRE role at New Relic, pulled from our engineering organization’s process documentation.

SREs at New Relic operate in two different contexts:

  1. Some are part of “pure” SRE teams that work to build and support our core internal platform, such as our container fabric clusters (our in-house container orchestration and runtime platform) and networking systems.
  2. Others partner with product engineering teams as domain experts in reliability, tooling, and scaling areas.

In both cases, the same fundamental role description applies. Similarly, the same description applies to all title levels of our SRE practice, although the focus and scope of work naturally changes as our SREs increase in seniority.
So, here is our internal SRE role description:

The SRE role at New Relic

SREs at New Relic are engineers who focus on, and are recognized primarily for, improving the reliability of systems in the New Relic platform. From a business perspective, the goal of the work that SREs do is to build and maintain our customers’ trust, and to allow the business to scale by steadily decreasing the per-service and per-host operational overhead of our global platform.

At a high level, SREs make this happen by

  • Championing reliability best practices.
  • Guiding designs and processes with an eye toward resilience and low toil.
  • Reducing technical complexity and sprawl.
  • Driving the usage of tooling and common components.
  • Implementing software and tooling to improve resilience and automate operations.

In some cases, SREs perform manual operations work (toil), but this kind of work is a tax on SREs that detracts from their core mission; it is not the reason why we have SREs. Necessary toil should be shared by an entire team rather than handed off to an SRE and should be a trigger for the team to automate that work.

Type of Work Examples Notes
Learn and enhance New Relic operational and reliability best practices (e.g.ha, capacity planning, SLOs, incident response) and work with teams to adopt those practices.
  • Work with teams to update their risk matrices; audit for missing or outdated runbooks; influence teams to prioritize the most important reliability work.
  • Work with teams to hold “game days” to test the resilience of their systems against injected fault conditions.
  • We expect this to be a particular focus of new SREs at New Relic and of SREs working with new teams.
  • We expect all SREs to stay current on platform tooling and SRE community best practices.
Stay current with the overall New Relic platform architecture and with the current state of, and top risks in, their teams’ “neighborhood” in production.
  • Meet with architects and SREs on other teams to discuss concerns and changes.
  • Use state-of-production knowledge to guide team risk matrices, operational processes, and priorities.
We expect all SREs to be familiar with the dependencies and underlying infrastructure of the systems they work with.
Building, or helping teams adopt, core shared internal platform components.
  • Work with teams to migrate systems into a new version of our shared deployment pipeline.
  • Contribute code or tools to our container runtime platform.
  • Limit technical sprawl by guiding teams to select appropriate existing tools rather than building new ones.
  • We expect SREs to heavily lean toward using existing tools rather than introducing new tools or systems.
  • We want our SREs to “look left and look right” at what others are doing as a starting point.
Improve the monitoring and observability of the New Relic platform.
  • Work with teams to clean up noisy unused alerts and ensure that important problems are alerted on.
  • Build a New Relic Infrastructure or a New Relic Insights integration to create new visibility into our platform.
We encourage SREs to actively use and extend existing New Relic products whenever it’s possible and effective to do so, and to influence Product Management to implement necessary features when it’s not.
Work with teams to design and implement automation, tooling, and application code to improve reliability and reduce toil.
  • Identify a commonly used manual runbook and automate it with software.
  • Identify a common failure pattern for new deployments and implement a system to automatically detect and roll back that type of failed deploy.
  • Work with teams on the design of new services to ensure those services will be scalable and robust, and will integrate well with the rest of the platform. New services should leverage our best practices and share common components.
  • Update an application’s DB connection pool to use a more reliable library.
  • We expect SREs to actively participate in the design phase of new systems and features to help them be born reliable and operationally sane.
  • We expect SREs to drive systems toward requiring increasingly less human intervention: manual operations should become automated operations, which would then become automatic operations, requiring no human intervention.
  • We recognize that in some cases, there’s no distinction between SRE work and other application software engineering, apart from area of focus.
Mentor less senior SREs and grow the SRE community and practice at New Relic.
  • Have a meeting, or lunch, once a week with a less senior SRE to discuss work challenges and solutions.
  • Pair with other SREs experiencing problems you’ve previously encountered or solved.
  • Document and share novel solutions and other effective strategies.
  • We believe that all SREs should have an SRE mentor or mentee. Mentor/mentee relationships are not team dependent.
  • We also encourage SREs to have non-SRE mentors.
Perform task-based operational work (toil) required to unblock teams with operational needs where automated or self-service solutions do not yet exist for those teams.
  • Track down hardware defects on servers.
  • Provision new network endpoints.
  • Run Ansible playbooks.
  • We believe that this is the lowest-value type of work, and SREs should not spend more than 40% of their time on this category of work (30% for senior level or higher SREs). Ongoing issues in such areas should be escalated through the appropriate management channels.
  • We believe that strong SREs should proactively look to reduce toil through automation whenever possible.

Set your SREs up for success

Although this SRE role description works well for us at New Relic, it may not be right for other engineering organizations. Regardless, we hope it provides a useful example and helps clarify the tremendous value a great SRE practice can bring to your organization. More important, by developing guidelines, companies can set their SREs up for success and advance the collective understanding of the key role the SRE practice will play as it matures to support the ever increasing complexity of our computing platforms.

Matthew Flaming started doing DevOps before it had a name, writing distributed Java systems and racking the servers that hosted them at startups in the late ’90s. He’s been involved with architecting and implementing SaaS software ranging from IoT cloud runtimes to massively scaled data platforms. Currently he’s VP of Site Reliability at New Relic, focusing on the SRE practice and the technical, operational, and cultural aspects of scaling and reliability.

View posts by .

Interested in writing for New Relic Blog? Send us a pitch!