In this episode of the New Relic Modern Software Podcast, we talk with Rob Gindes, manager of site reliability engineering at media giant Gannett, publisher of USA Today and many other popular media brands.
My co-host, New Relic Developer Evangelist Tori Wieldt, and I ask Rob about many of the key concerns facing modern cloud-based IT organizations, from cloud engineering to DevOps culture to containerization, and what it’s like rolling out Kubernetes on election night—one of the biggest days of the year for news organizations. He even shares hard-won takeaways from Gannett’s cloud migration and his reliance on New Relic to help the process run smoothly.
You can listen to the episode below, subscribe to the New Relic Modern Software Podcast on iTunes, or read on for a transcript of our conversation, edited for clarity:
New Relic was the host of the attached forum presented in the embedded podcast. However, the content and views expressed are those of the participants and do not necessarily reflect the views of New Relic. By hosting the podcast, New Relic does not necessarily adopt, guarantee, approve or endorse the information, views or products referenced therein.
Fredric Paul: Today we’re going to talk about Gannett’s SRE and DevOps efforts, and your migration to the cloud. We’re especially interested in the cultural and organizational issues involved in those efforts. But for listeners who may not be totally familiar with Gannett, can you give us a quick description of the company? I think folks will recognize many of your brands.
Rob Gindes: Beyond USA Today, we manage a whole bunch of other digital properties, over 100 news sites, some of the bigger ones: Cincinnati Enquirer, Indianapolis Star, Detroit Free Press, Arizona Republic, AZCentral.com.
Fredric: Your title is Manager of Site Reliability Engineering. Can you describe your role exactly, and what the SRE function means at Gannett? It means a lot of different things in different places.
Rob: This is what I always refer to as buzzword bingo. I manage SRE for a PaaS, right? Literally you name the hot buzzword in technology right now and I touch it. So it’s cloud engineering, DevOps culture, SRE, Platform-as-a-Service. We talk about Kubernetes, containerization, Docker. We’re looking at managed Kubernetes services out of Amazon and Google as well.
It’s funny because it’s the same annoying buzzword-laden jargon that I secretly hate, but it’s also my actual job, so it’s hard.
I think everybody who uses SRE means something a little different by it. For us, we are really sitting at the intersection of a lot of different things. It’s not just between the team, the larger Platform-as-a-Service team.
We find ourselves in this interesting position where you’ll hear me use the word “alignment” a lot. We have a lot of different groups, from all the infrastructure operational teams to the development teams, the full stack engineering teams all the way up to ad revenue, ops, and line of business … editorial even. What we realize is, the more aligned all those groups are with one another, the more successful we really are.
Our team has this goal to try to align the vision as much as possible. We do that in a lot of different ways, and New Relic actually comes into play a lot when we talk about that. Our whole vision with New Relic is being able to build visualizations within it that everybody in our company can use. It might mean something different to different people, but we’re all working off of the same playbook, this thing that tells us how good what we’re delivering is. A big goal for us is building that kind of stuff and making sure everybody is on the same page.
Fredric: So the SRE team is responsible for establishing that common truth and spreading it throughout the company?
Rob: We’re in the very early stages of it. So it’s more of a responsibility we’ve taken on ourselves. A lot of it is just uncovering around the company what people’s different goals are, and then being able to understand that through the lens of the teams we support, and then finally through our own lens. So there’s a lot of different stuff that comes with that.
To give you an example, our team brought us this big managed Kubernetes cluster. And that Kubernetes cluster runs one application, that application is the thing that runs behind our web tier for everything we have that’s not USA Today. Beyond USA Today we’ve got all these different properties; by my count it’s 131, but anybody you talk to at Gannett will give you a different number. I don’t know why.
Tori Wieldt: Reality.
Rob: I’m not sure why we don’t have that number pinned down, but to the best of my understanding it’s 131 news properties besides USA Today that all run on the same application framework. And we run all of that out of one giant Kubernetes cluster.
There are a lot of interesting challenges that came with putting that into Kubernetes, making sure that it runs and performs the way that the people who consume it expect it to perform. Then a lot of it turns into a cost optimization game now that we have this thing running. We understand that it works and that when people go to IndyStar.com, they get something back.
Then we ask, Can we start taking servers away? Can we start running lighter and start running cheaper and faster and continuously improve from all different angles?
That’s the goal we have. The knowledge that we’re gaining, that we try to turn around and give to different teams that manage their own stuff. That’s really the only thing we actually directly manage, everything else is more like a consultation type of role. All the way through all the different sorts of web applications. For us it’s a big support role.
I’d hesitate to say we exert any authority by being able to tell anybody what to do; it’s just a big fact-finding mission around all these different things that we manage. We’re a big microservice type of ideology—“What’s the expectation out of this thing?”—and just helping people get to the point where they’re comfortable saying, “Is my service performing or not performing? If it is performing, can I run it cheaper, lighter? If it’s not performing, what do I need to do to get it to that objective?”
Fredric: It’s my understanding that all of this came to a head in a recent epic cloud migration that you guys performed.
Tori: I heard a rumor that you rolled out Kubernetes to USA Today on election night. Is that true?
Rob: It was the first time that we had rolled out Kubernetes in production. We were really kind of challenging ourselves.
We’d been looking into Docker for a while; we had done proofs of concept around a lot of different schedulers. I remember sitting in a meeting going, “We can’t leave this meeting and not know what we’re going with, or we’re not going to hit the elections, we’re not going to hit our deadlines that we want to hit.” We all were sitting there going, “We have to walk out of this 100% confident that we know what we’re going with.”
So Kubernetes became the decision. We spent a few months supporting it. We have a web development team, a lot of really talented web developers who we’ve also charged with learning DevOps, and they’ve done a great job getting up to speed on that. We worked closely with them leading up to the elections, and then for election night we said, “Let’s kick it in and let’s go.” Obviously, we had a bunch of servers running the old way as a fail-safe. We’re not that crazy! But we were like, “Let’s run Kubernetes in production. Let’s see how it goes.”
I think they ran 300 deployments … 1,000 deployments. We had a lot of success with that and we really showed the power of what containerization gives you, which is the deployment speed. We were using Amazon servers that we’d spin up and configure. And that would take 10, 15 minutes every time we needed to do a deployment. Now we’re doing those deployments in two minutes, one minute.
Rob: That completely changed the paradigm of what we could do. We wouldn’t have been able to do deployments that fast if we weren’t in Docker. And then we talk about the resource capability, the fact that you’re using resources more efficiently and cheaply. That’s the thing that our bosses really appreciate, when we start to reduce that cost.
Fredric: You were doing this on election night. Were you tracking the results of this with New Relic? And were there ever any moments during the evening where you started to wonder how things were going to work?
Rob: We were confident enough … this is where I really start to shill for New Relic because this is a real thing. This sounds like customer testimonial, but this is just the reality. We had gone basically all in with New Relic before that—we had all the monitoring set up, we had done our load testing and our performance testing, we had really put this Dockerized version of the application through its paces and New Relic was the thing that we used to look at it and say, thumbs up or thumbs down.
Time and time again when we have to make these decisions, we go to New Relic. That’s our all-encompassing tool from the application performance layer, up to the browser interaction layer, down to the infrastructure that’s running it, and all the way down to the Linux system that’s running underneath a Kubernetes cluster setup.
It’s vital to us to understand that we’re performing okay at all those different levels and be able to create a visualization so lots of different people can look at the same visualization, go to the same dashboard. That’s a really important thing to us and it wouldn’t necessarily work without that.
Fredric: Can we talk about your overall goals for the migration and moving to containers and Kubernetes? Did you meet those expectations?
Rob: We have an optimization team that does great work, and I want to make sure I’m giving them their due props because the stuff that team does is incredible.
That said, we as a platform team have a charge that we should always be worrying about helping teams improve performance, reduce cost, and also be able to do things efficiently and have a good quality of life when they’re at work. I think those are the basic pillars of what our platform is all about.
I don’t know that we’ll ever necessarily feel like we’ve met those goals. I don’t know that we’ll ever be able to say that every system we touch has been made 100% efficient, everything is running as cheaply as it possibly can, and, on top of that, we’re deploying as fast as possible and the resource usage is exactly where it needs to be. There’s always going to be more to do.
On election night I think that we made a lot of great strides forward. Before that, to run everything that’s not USA Today, we had something like 3,500 servers. It was a huge cost every day.
The process to deploy new versions of that app was super streamlined and some really smart people had worked really hard to get that down to two hours. When we moved that to Kubernetes, we chopped that number of servers down. It’s half the cost every day, which is a huge number for us. The deployment time went down to 25 minutes, and we think we can get it down even further. We’re able to fail back and forth between regions and we’re able to scale. We’re able to scale up, scale down, and use resources more properly. Those types of huge strides we’ve made with containers are what we’re all about.
It goes back to New Relic, which gave us the confidence to ask, “Can we take all these servers running this app and just move them to Kubernetes? Yes? No? I don’t know?” We looked at New Relic and we ran load tests, we did the math, and we understood what the end user performance looked like and what the experience was like.
That gave us the confidence to give it a thumbs up.
We want to make strides that hit one of three things: “Is it cheaper? Is it faster? Does it use resources better?” I said three things, and now I’m listing a million things. “Does it move us forward in some way?” Either by reducing the cost financially or in the SRE term called “toil”—the repetitive, mundane tasks that people have to do to keep systems going. They don’t require a lot of brainpower but require a lot of people to kill it with humans.
Fredric: Excuse me, did you just say, “Kill it with humans”? I love that phrase; I haven’t heard that before.
Tori: Also known as automate all the things.
Fredric: Based on all of this, are there any takeaways that you would share for other companies that are moving toward cloud services, toward containers, towards DevOps and SRE functions?
Rob: Everything has one major common thread, which is cultural alignment. You can’t start down the road of DevOps, self-service, full-stack engineering, Platform-as-a-Service, Infrastructure-as-a-Service without cultural buy-in. You need people in place who say, “I understand this vision, I agree with this vision, I get the value, I get that it’s a long-term thing, I get that it’s an investment.” All those things, because it’s a thing you can’t really do halfway.
We started down this road when we started playing around with cloud tools more than three years ago. We were using cool cloud tools and speeding people’s servers up in Amazon.
People who managed applications used to have to go to somebody and ask for a server, and then they’d have to go through meetings and tons of red tape, and a couple weeks later maybe you’d have a server or maybe you wouldn’t. It was a long, drawn-out process, very opaque. Pretty quickly we were able to make that process more transparent, we were able to make it a lot faster, we were able to start going down the road of self-service and all this awesome stuff. But we weren’t really able to make strides until we had people who stood up and said, “I understand the value of this.”
Put yourself in the shoes of the developer. If I’m a developer at Gannett in the year 2013 and this PaaS team comes to me (at the time we were called the DevOps team) … If I’m a developer, it’s like, “Okay, I develop in Node.js” or “I develop in Python.” That’s my thing, I’m good at it. I build this app and I give it to somebody else, and then something happens and I go home.
And then our team came in and said, you know, we can make this process a lot better, but we’re going to put some more of the burden on you as the developer. Now you don’t just hand that off, now you have to get more into the nitty-gritty, you have to learn the operational side, you have to learn these DevOps concepts, you have to learn cloud engineering a little bit. And it’s the same now with Docker—you have to learn how to write a Docker file.
We had to be sensitive to the fact that we were asking more work of people, and that we had to go to teams and demonstrate the value to make it clear that it’s not just that we’re shoveling more work on you, it’s that this thing will help everybody move forward and achieve this vision, which is everything running cheap and everything running disposable and everything being modular and easy and fast.
We specifically wanted to be the ones that managed this giant Kubernetes cluster because we hadn’t done anything at that scale before. And it’s broken a lot of times. Our deadline was around September 1, and we got it up and running a week early.
That was my first big project as a manager. I said, “Yes! This is awesome.” I sent this email to our CTO: “Hey, I just want to let you know, look at this great stuff that we’re doing. Look how awesome my team is. Big pat on the back for me, I’m awesome, wonderful.” I went to sleep, woke up the next morning … [to learn we’d had a] hard down for four hours.
Those pains happen, they’re very real. Everybody in technology likes to think that we’re doing stuff that’s safe and we’re smart and we built a resilient system. But sometimes stuff falls over, right? The great thing about Gannett is, I was embarrassed by that but I wasn’t worried that I was going to lose my job.
Tori: Blameless postmortem, right? I just wanted to ask you, are you using New Relic Infrastructure? And how did that help with your cloud migration?
Rob: New Relic Infrastructure is the tool we use for the operational side of things. For a long time we have used all the New Relic application-side stuff. (I always lump together New Relic Synthetics, New Relic Browser, and New Relic APM as application stuff.) Infrastructure gave us the opportunity to match that on the other side of things.
If you look at that Insights dashboard, it covers every level. There’s a little bit on the CDN level that gets down to Browser and Synthetics, down to APM, that goes right to Infrastructure. That shows you what’s going on with this application from a resource usage perspective. It’s such an important thing to us and honestly it’s one of the biggest things we’ve turned back to New Relic on—the ability to build even stronger and more intricate connections between the different products within New Relic.
That’s the really powerful thing for us now. We want to be able to say something like, “Here’s a browser interaction. I want to drill all the way down through that.” That’s the really cool thing and that’s the power of New Relic—that’s all there. Our challenge now to New Relic is to say, “Let me make more of those logical connections because they’re so powerful.” When somebody goes to Indystar.com and they have a bad experience, I can go to New Relic and eventually I can figure out why.
I think that that is so cool because you were never able to do that. Rewind 10 or even 5 or 3 years and I don’t know that people could really say something as simple as, “Oh, this was loading slowly. Is it an app issue or is it an infrastructure issue?” Now the same team is trying to answer the question, “Is it an upstream dependency? Is it a downstream dependency?” And then if you can figure out, “Oh, you know what, actually it is, it’s an infrastructure problem. Okay. Well, is it Kubernetes or is it the system running underneath Kubernetes? Or is it the system running underneath that?”
We have all these different layers of abstraction, this microservice architecture with all these different APIs and things that are talking to each other. To me this is the core problem of SRE: things are getting more and more complex. Everything we deal with, we ask more and more out of it. And yet we have to somehow turn around and make that easier to consume.
It’s the same thing that we challenge New Relic to do. We want more and more out of New Relic, but then we need to turn around and be able to consume that really easily and understand problems really quickly. Same thing with us, we push more and more on teams and there’s more and more parts of the stack that teams have to touch and they have to understand, more and more of the environment. They don’t just get to work in a little monolith; they have to understand so much more than they’ve ever had to understand before. We have to make that feasible, and New Relic was the tool we used to do that.
Note: The intro music for the Modern Software Podcast is courtesy of Audionautix.