We hear a lot about how DevOps is transforming modern software development, but it can be double-edged sword. When done well, DevOps enables software teams to confidently ship software faster than ever before; when done poorly, teams see it as just another process failure. Managers of software teams work stridently to prevent the latter.
Rather than managing a team focused on one area (development) that struggles to maintain a relationship with a team in another area (operations), managers in DevOps environments provide guidance to a single team. They work to ensure that these traditionally siloed roles collaborate in a low-stress yet high-production capacity with customer satisfaction as the driving factor. The idea is for the team to be as frictionless as possible.
To find out more about how this works in the real world, we asked Jason Poole, a senior software engineering manager at New Relic, about how he gives the New Relic Mobile monitoring team the guidance and support they need to successfully continue their DevOps journey.
On team composition, work, and communication
New Relic: What does your team build and what is the composition of your team?
Jason Poole: My team builds New Relic’s Mobile APM product. We provide the tools and UIs for mobile application developers to monitor their Android and iOS apps.
We have seven people on our team, all of whom are experts in their particular areas. We have site reliability engineers (SREs), specialists in the Android and iOS agents we support, and UI developers. That said, though, our goal is to be as “T-shaped” as possible and have everyone know how to work in every part of our stack. We don’t want the team to stress, or work to grind to a halt, if someone takes a vacation or some sick time.
New Relic: How does your team balance feature, reliability, and customer-focused work?
Jason: We work with our product manager constantly to decide the most important thing to work on. We work in weekly sprints and keep our MMFs (minimum marketable features) small, so if the most important thing to work on changes, we can shift quickly to meet that new need. As the engineering manager, I work with the product manager to balance reliability alongside feature work, on a weekly basis, to make sure both sides are accounted for.
New Relic: What kind of reliability work does your team do? How does regular reliability work enhance your DevOps culture?
Jason: Our reliability work is based on the idea that we don’t want to have incidents. What can we do to proactively avoid incidents? We think a lot about capacity planning—how do we scale our hosts or databases? We compare where we are now to where we want to be in the future. We also do a lot of work to keep our dependencies and libraries up to date. Same thing for security; we make sure we keep our hosts and services up to date with all the latest patches.
Our goal is to stay on top of things so we don’t get bitten later. And this enhances our DevOps culture because we’re all responsible for the reliability work we do to keep our part of the product healthy.
New Relic: How does communication work within the team? How does the team communicate with other teams?
Jason: Communication within our team, I’d say, is excellent. Because we’re able to focus on one MMF at a time, there’s minimal context switching and most team members are in the same mind-set at the same time.
In terms of interacting with other teams, having T-shaped engineers means anyone from the team is generally able to talk about our stack with any other team. In the rare case they can’t answer a question about our stack, they at least know who the expert on the team is.
New Relic: Does your team ever have turf wars?
Jason: [Laughs]. Luckily no. We’re pretty much all on the same page. The team has been extremely fluid working in this DevOps model.
New Relic: Did you have to help and encourage them to adopt or try new processes or ways to work?
Jason: When we first got started as a team after Project Upscale, where we let our engineers self-select onto the teams they wanted to join, it took a while to sell the team on swarming, which is all seven of us working on the same thing instead of seven different things. Most of the discomfort they expressed was having to work with tools or a part of the stack they were unfamiliar with. This was essentially the start of their DevOps journey, and once I helped them iterate and improve on the process, they were off to the races.
New Relic: How do you manage incident postmortems? Do you have a blameless culture?
Jason: We have a unique approach to our postmortems. We don’t have a certain threshold, so much as a “gut feel” for when we need to hold one. It could be because an incident was highly severe, or it could be that we just need to come together to talk things through.
We do have a “blameless” culture, but don’t necessarily label it as such; instead, we try to focus on team trust. We want to make sure we hold each other accountable for the work we’re doing.
New Relic: What are the team’s rules of engagement?
Jason: In truth, we try to have as little process as possible, but we do hold retros every other week where we can decide to change things if needed. We dedicate one person per sprint to triage incoming requests from support, and they also flag trouble tickets that my team will review in their daily stand-ups. Once the issue is triaged, I work with our PM to decide how it gets prioritized. I advocate for reliability work, but we have to set priorities for what the team spends time on, whether it’s ops related or not. When requests come in from other engineering teams, I work with our PM to triage those requests. This makes it easier for the team to work without interruptions.
As an engineering manager, I can see this is a “DevOps way” to work, but it’s also about simply being a high-functioning team.
On DevOps principles in practice
New Relic: What do you do to foster close collaboration between dev and ops team members?
Jason: We try to never have anyone working on their own at any time. Our dev or ops experts always have someone working with them; you always have at least one partner. And everyone works to share info, so we never have to worry about having a single point of failure on the team. You know, we all want to eliminate the bus factor. [Laughs] If someone gets hit by a bus, we can carry on without them.
New Relic: What are the stages in your development cycle? How long is a cycle?
Jason: As I said, we keep our process light. We have a planning stage with the product manager and the designer dedicated to our team. Then we have an MMF kickoff to break down and estimate the work and that leads straight into development. We keep our MMF dev cycles two to four weeks long, shorter being better for agility, of course.
Our industry changes quickly and we need to be able to react just as fast. If the technologies our customers are using change, we need to change, too. If a critical feature pops up, we want to be able to work on it as soon as possible. Short cycles mean we can change and pivot quickly. Since our sprints last only a week, the longest we’d ever have to wait to pivot to some other work is 7 days. This applies to our MMFs as well—we keep them small so we can complete them before needing to change course if we’re asked to.
New Relic: How often do you commit code/deploy to production? To dev or staging?
Jason: We deploy to production as needed. That could be zero times on one day and five times on the next. It really depends on what we’re working on. Our application is made up of several services, so we deploy canaries of those services to production more frequently than that, maybe three times as often. And we deploy to staging even more often than we deploy canaries.
We deploy our agents far less often, however. These agents go inside customer apps and must be very stable. We probably release them to production once a month.
New Relic: How long does a typical deploy take? And do you ever have to roll them back?
Jason: It varies a bit with each service. From the time we decide to deploy to the time it’s complete is usually less than 15 minutes. For some tricky services, it might take an hour for it to roll out fully and finish rebalancing the load. We’re updating our services, though, so these times are always improving.
Thanks to our canary deploys, we don’t have to roll back deploys too often. Most of the time we keep rolling forward, and if there’s something that needs to change, we can change it and deploy a new version. Always rolling forward encourages us to make incremental changes as opposed to big ones. This is how we’re able to keep trying new things—if we hit a problem, it’s easy to roll back.
New Relic: What automation tools does the team use?
Jason: Our build and deploy pipeline is all managed by internal tooling called Grand Central, which ties into our internal container orchestration platform called Container Fabric, which hosts a large portion of the New Relic ecosystem. We do run Jenkins to build our agents. We run automated tests against our UI with Sauce Labs, and we use New Relic Synthetics to constantly verify our endpoints.
New Relic: Do you have service-level objective, agreements, and indicators in place?
Jason: We do have an SLA. It’s been defined for us by the larger product organization—four nines. 99.99% uptime.
Of course, this also depends on other teams and parts of the New Relic ecosystem. But we do use New Relic Alerts proactively to look for trends that could lead to failures before they become actual incidents. Waiting to catch a failure as it’s happening means you’re already too late.
New Relic: How do you monitor the performance of your applications?
Jason: That’s a huge question!
We measure the performance of our UI components with New Relic Browser, with a particular focus on how long it takes the entire page to load. For the services that make up our application, we measure their performance with New Relic APM, keeping a careful eye on how long services take to respond to requests as well as watching for anything that could be slowing those requests. We also check response times of our API endpoints with Synthetics. Our services are constantly evolving as we work to break them into more discrete units. And because of this we continue to change our monitoring strategies.
Finally, for our agents, we measure CPU and memory usage to ensure we have minimal impact on our customers’ applications.
New Relic: What other monitoring do you use to help you meet your DevOps objectives?
Jason: As I said, we do use proactive alerting, but we also have several New Relic Insights dashboards that keep us aware of what’s happening in our systems. Primarily, we focus on monitoring CPU usage, server health, and the data flows in our Kafka queues. We also keep track of user accounts to see when and how our customers are interacting with our services.
New Relic: Any other thoughts about DevOps team health from your perspective as an engineering manager?
Jason: This entire paradigm shift to DevOps has been fundamental to our success at New Relic and has changed the way we think about creating software. Our teams are healthier because they work toward the same goals together, solve trickier problems together, and celebrate wins together.
Rather than managing siloed teams of dev and ops engineers, who only think about each other when they want something or when something has gone wrong, I’m managing one team that is focused directly on delivering customer value. Managing a team like this is a joy—we all share the same vision and are all empowered to make decisions about not only our team, but the products we love to build.
Ready to measure your DevOps journey?
Using New Relic and our Guide to Measuring DevOps Success to prepare, activate, and optimize your DevOps journey will help you track every step, be deliberate about every decision, and increase the odds of your success.