In a recent post, we examined the rise of the Site Reliability Engineer in modern software organizations. But it’s one thing just to be called a SRE; we also wanted to know what it takes to become a great one.
So we decided to look at some of the characteristics and habits common to highly successful SREs. As in most development and operations roles, first-class technical chops are obviously critical. For SREs, those specific skills might depend on how a particular organization defines or approaches the role: the Google approach to Site Reliability Engineering might require more software engineering and coding experience, whereas another organization might place a higher value on ops or QA skills. But as we found when we looked at what makes dev and ops practitioners successful, what sets the “great” apart from the “good enough” is often a combination of habits and traits that complement technical expertise.
The seven habits outlined below were derived from extensive interviews with New Relic Software Engineer Beth Long and Site Reliability Engineer Jason Qualman. Let’s dive in:
Habit 1: You analyze every change in the context of the (much) bigger picture
Successful software developers understand how their code helps drive the overall business. SREs have their own version of this trait. “You’re looking for someone who is really thinking about the bigger picture outside of the day-to-day,” Jason says. “A successful SRE is someone who can understand and interpret things at a higher level than that.” At New Relic, we describe it internally as “someone who is constantly analyzing every change for its risk and what its impact could be down the road, not just today. And what does that mean for the larger infrastructure?”
Habit 2: You’re pragmatic and forward-thinking about that analysis
The best SREs take a pragmatic approach and consider how their work is going to affect the rest of a particular system or team. There’s little upside in a siloed approach that throws a change over the wall with no concern for how it might affect the person sitting on the other side.
“We are making decisions very low in the stack,” Jason says of the SRE’s role. “Sometimes that can affect people all the way up. You need someone who can understand how their solution to a particular issue is going to affect someone else way down the road.”
Habit 3: You are willing to move on when something isn’t actually helping
For an SRE, part of being pragmatic means being willing to dump processes and procedures that may be well intentioned but don’t turn out to actually be productive. Beth recalls an internal example of this when New Relic was evolving its reliability practices.
“A few years ago we were going through a phase of rapid growth, and to deal with any associated instability we implemented a ‘Change Acceptance Board’ (CAB) process. It was intended to help us evaluate releases before they went into production in order to prevent breaking changes from causing further incidents. The irony was that by slowing down our release cycle, we began to accumulate bigger and bigger changes, which had the exact opposite of the intended effect. These larger changes actually increased the risk associated with each release.”
Eventually, the CAB process was scrapped in favor of more frequent but smaller releases, which yielded far better results.
Habit 4: You embrace every opportunity to automate
Top-notch SREs successfully cope with a key challenge: how to increase the reliability of everything they touch without slowing the company’s ability to ship software quickly. The solution is almost always automation. Great SREs are proactive about finding ways to address painful manual tasks, bugs, and so forth with new ways to automate that process or fix.
“A lot of this role is thinking about inefficient and time-consuming things people are doing and putting a stop to them as soon as possible,” Jason explains. “Instead of kicking a can down the road on manual work, you’re saying, ‘I’m going to take the time to automate this right now and stop anyone else from having to do this painful thing.’”
This obsessive focus on automation isn’t unique to New Relic—The DevOps Handbook has a chapter that discusses the counterintuitive effects of manual acceptance processes, for example. And “automation” and its variants seem to appear more often than any other word in SRE job descriptions. A recent opening at Procore Technologies in Los Angeles, which makes construction-management software, lists this as the second bullet point in its SRE job description: “Automate, automate, automate and then … automate!”
Habit 5: You can persuade organizations to do what needs to be done
The confidence to advocate for a particular automation task or SRE initiative is another attribute that sets apart A-team SREs. You need to be willing to go to bat for why it’s critical to automate a particular process or other piece of work. And that can be problematic, because it can appear to clash with the culture and pace of many traditional software organizations.
Great SREs live their own engineering-centric version of the self-help classic How to Win Friends and Influence People. Part of the job, simply put, involves convincing other people to do things they initially might not want to do; for example, working with a software engineer more focused on product features than, say, problems that might occur as the product scales over the next several years.
Great SREs have to be effective salespeople, able to sell their colleagues on the long-term benefits of automating a particular process or project, even if that might appear to involve some near-term pain. Bottom line? “You need to be able to dig in and say ‘stop’ and ‘no, we really need to to do this thing now,’ which can be difficult to do in some engineering organizations,” Beth explains.
Habit 6: You expand your existing skill set to include new tools and approaches
Since the SRE concept is still new-ish, many SREs have worked in other jobs prior to assuming the role. Some SREs might have a developer background, while others may come from a traditional operations background. Jason and Beth note that, in general, hiring managers are best served by not pigeonholing the SRE role to one particular background. A traditional QA engineer might have a good makeup for the SRE position, for example.
No matter your background, there’s a decent chance the SRE role will challenge you to move out of your comfort zone and develop new skills. An ops practitioner might benefit from learning a programming language or three, for instance; someone with a dev background will need to be willing and able to think much more deeply about operational processes and challenges than they probably did in the past. The best SREs embrace that kind of learning and skill development.
Habit 7: You trust the process
If there’s a guiding philosophy for the highly successful SRE, it might be expressed this way: you’re not actually chasing a holy grail of preventing anything from ever breaking. That seldom works. Instead, you work tirelessly to see the big picture, incorporate automation, encourage healthy patterns, learn new skills and tools, and improve reliability in everything that you do. Perfection can never be attained, but constantly striving to do things better is the way to get as close as possible.
DON’T MISS: The Rise of the Site Reliability Engineer