In a conversation at FutureStack: New York 2017, Google Site Reliability Engineer Liz Fong-Jones noted the crucial importance of standardization—on both processes and tools—for enabling relatively small site reliability engineering (SRE) teams to support much larger organizations.
As Liz told Matthew Flaming, New Relic vice president of software engineering, “One SRE team is going to have a really difficult time supporting 50 different software engineering teams if they’re each doing their own separate thing, and they’re each using separate tooling.”
Standardization of SRE tools
It makes absolute sense—standardization is key to a successful SRE practice and for proper implementation of DevOps principles. But what tools should SREs standardize on? Each team needs to decide what’s best for them. The good news: they definitely have choices.
Just as there’s not a universal job description for SREs in every organization, there’s not a standard toolset for the SRE role either. Jason Qualman, site reliability engineer at New Relic, says it may be more helpful to think in terms of architecture style rather than tooling.
Containers and microservices play a significant role at New Relic, for instance, so Docker and container orchestration are integral parts of our SRE toolset. “I think the biggest tool that SREs are using today is an orchestrator like Kubernetes or Mesosphere, where you basically have this huge machine you can just throw boxes at, and it decides where to put them—and if they go away, it puts them back,” Jason says. “It’s a giant system that’s making sure your service is there at all times.”
The SRE toolchain at New Relic also includes both external and homegrown SRE tools. “We built a lot of our own infrastructure for managing the building and packaging as well as the deployment of applications,” says Henry Shapiro, New Relic vice president and general manager of New Relic Infrastructure. For example, New Relic SREs and other team members rely on an internal system called Grand Central for releasing and servicing the lifecycle of their applications. Another tool, called GateKeeper, functions as sort of a “pre-flight checker for deployments.”
Best SRE tools for each stage of DevOps
It should come as no surprise that the SRE toolchain looks a lot like various iterations of the DevOps toolchain, especially if you see the role of SRE as being, as Matthew puts it, “maybe the purest distillation of DevOps principles into a single role.”
Henry Shapiro notes that the DevOps toolchain can help teams choose the tools they need to Plan, Create, Verify, Package, Release, Configure, and Monitor the software they build.
At each of stage of the loop, there are tools a DevOps team uses to do their jobs, and an SRE toolset could look very much the same, depending on how the role is defined in a particular organization. For example, at New Relic SREs play an increasingly important role that combines responsibilities once siloed in traditional dev and ops teams. As a result, the difference between a “DevOps toolchain” and an “SRE toolchain” becomes fuzzy in our organization.
SRE tools for each stage include:
Create. Integrated development environments (IDEs), text editors, and shared libraries and components—“the building blocks that you use to actually build applications,” as Henry says. Even here, SREs have a role to play, such as encouraging development teams to avoid building everything from scratch in favor of reusing reliable code or third-party libraries.
Source-control tools like GitHub and Subversion erase boundaries between dev and ops roles, and enjoy significant popularity among SREs tasked with managing deployment environments and processes.
Package. Tools to manage the packaging, release staging, and approval process, such as JFrog.
Release. Tools to manage releases and the lifecycle of an application, like New Relic’s homegrown Grand Central.
Configure. Tools like Terraform and Ansible fit the “automate, automate, automate” SRE philosophy, and enable teams to automate and manage configurations across infrastructure and applications. SREs are playing an increasing role in determining what those configuration should look like from a health and reliability perspective, as well as automating away much of the manual work formerly needed to implement those rules and processes.
Both Henry and Jason note that the increasing use of containers may ultimately reduce the need for these tools in many organizations. Because containerized applications include all of their dependencies and configurations in immutable configurations, container platforms like Docker and orchestration tools like Kubernetes are becoming indispensable to SREs.
Monitor. Monitoring can mean a lot of things to a lot of people, but Henry notes that this stage includes tools like New Relic that collect metrics from applications and infrastructure, some form of log or analytic data, and alert on that data via dashboards.
Health Map and New Relic Insights: New Relic tools for SREs
Henry sees two particular New Relic tools as especially strong fits for the SRE toolchain, primarily in the monitoring space but also intersecting with verification.
DevOps, containers, and cloud platforms blur the lines between applications and infrastructure. Containers, in particular, package an application and all of its dependencies in an abstracted layer that requires a combined view of infrastructure and applications. “Worlds are colliding in terms of application monitoring and infrastructure monitoring,” Henry says—creating a new area where SREs toil and need tools.
That “collision” was the genesis of New Relic’s Health Map feature, Henry says. Released earlier this year, Health Map is “a high-density view of all of the instances that are running for a given application,” Henry explains. “It gives the status of all of the instances, as well as the containers running inside them, and the status of the application as it relates to that infrastructure.”
SREs need to understand how to provision and manage infrastructure to support the applications they work with, Henry says. “Health Maps is a great way for them to get that insight. It’s about having this combined view of application health and infrastructure health.”
New Relic Insights, meanwhile, is becoming a go-to analytics tool for SREs, Henry says. In addition to helping build reliability into development practices, putting out fires is also part of the SRE job. Nothing is failsafe, but having real-time analytics data can help solve issues quickly and minimize their impacts.
Insights can be particularly useful to SREs in two ways. First, New Relic Query Language (NRQL) enables New Relic customers to create ad-hoc queries to hone in on specific aspects of a problem. Aggregate or higher-level metrics might tell you what the problem is, but not always why it exists. NRQL enables SREs and other operations pros to triangulate the “why” of a particular issue with raw event data.
New Relic Insights also helps SREs create and monitor custom data sets. For example, if an SRE is unable to put a New Relic agent on a particular type of host, they can still ingest logs from that host into Insights and build alerts off that data for ad-hoc analysis or other uses.
For the SRE, nothing is written in stone
Many companies are working to define their expectations for the SRE role, and the SRE toolchain, like the role itself, continues to evolve. The tools SREs use at any given time will depend on where an organization is at in their SRE journey. Less mature organizations will tend to use more specialized operations tools while more mature organizations will see more convergence between SRE and software engineering toolchains. So while it’s certain that there’s no “one-size-fits-all” set of tools, SREs will experiment with and adapt the right tools as they seek new, efficient ways to bring greater reliability to everything they do.