Here at New Relic, we’re just beginning to give official definition to our Operations team – or as we like to call it, Site Engineering. But that’s not to say we haven’t spent considerable time and effort in establishing the type of team we want it to be. While collecting 51 Billion metrics per day has provided us with some interesting challenges along the way, we haven’t been sidetracked into building a complex infrastructure to support such a massive undertaking. In fact, it’s our dedication to simplicity in operations that has given us the opportunity to focus on what really matters: our users.
Below is a summary of our story so far and the lessons we’ve learned. You can also view the accompanying presentation.
“Dedication to Efficiency, Getting Along on Little Power”
Operations, and the technology choices we make, shouldn’t be a burden or distraction from our primary focus — delivering happiness to our users. Using a well established and stable technology stack, we’re able to maintain our infrastructure with little effort and a small amount of resources. By making technology choices that are efficient and uneventful, we can focus on what really matters. This helps us avoid spending time fighting fires or establishing a complex system to prevent them.
Coupling Our Business to Technology Choices
We want to get our application into our users’ hands as fast as possible and get their feedback just as quickly. This means we can’t have our business choices predicated on our technology choices. We need to make shifts as we learn what works for our users and what doesn’t. The same tight coupling principles we use in software development also apply to our operations and technology stack. If you’ve implemented a particular technology (programming language, datastore, etc.) that requires significant reworking as you make the inevitable shifts, then you’ve coupled too tightly.
Engineer, Don’t Administer
We want to solve our infrastructure problems through engineering the solution, not administering it. When hiring, New Relic looks for individuals that know how to build tools that remove pain points rather than individuals who are experts in a particular piece of technology. Companies and products shift over time. We want people who can ease the friction from those shifts. That means we need generalists who take the best from disparate areas. As we continue down the path to DevOps, we need our operations staff to think like software developers because, over time, our infrastructure will begin to look more and more like software.
Operations Should Be Interesting, Not Exciting
Firefighting may look exciting in the movies. But as an operations team, that’s not what we want to be known for. The operations world is filled with brilliant and passionate people who are stuck fighting fires as 3 am, instead of spending their days working on hard and interesting problems. We want to fix that and get to a point where failure at 3 am doesn’t wake anyone up. If we’re spending time fighting fires, we’re not spending our time creating a great product. An operations team can be building self-healing and self-organizing infrastructures. Or it can be resolving database replication problems in the middle of the night. Which would you rather be working on?
We are deliberate when making choices about our operations processes and technologies. In the same way that ancient tools found in a buried city tell us something about that culture, our tool choices tell us something about ours. We look for tools that are mature and have a healthy ecosystem around them. We want to understand our tools intimately. We’re going to push them to their limits and want to know how to improve them as they start to show their stress points. We want tools that we can easily integrate into our world and swap them out when needed. We are extremely considerate of our culture and choose our technology with the same level of consideration. What do your tools and processes say about you?
Optimize for Discovery
Our technology and process choices have aligned us to be optimized for discover. We want to get our products out to our users quickly and the barriers standing in the way of that are broken down first. Implementing a continuous delivery pipeline wasn’t simple nor easy, but now our users can see improvements and features rapidly. Feature flags allow us to roll out features incrementally, to A/B test them and quickly iterate on ideas. As we work on complex systems, we know something will inevitably break. We want a resilient infrastructure where our MTTR is small and MTBF isn’t even a consideration.
Every employee at New Relic is behind our build, measure, iterate cycle and operations is at the center for this process. Without careful cultivation of our DevOps culture, we wouldn’t be optimized for discovery.