In this episode, Matt Stratton, Transformation Specialist at Red Hat’s NAPS Transformation Office, talks about his career in DevOps, developer advocacy and relations, why monitoring matters, discoverability, resilience engineering, and systems thinking.
Should you find a burning need to share your thoughts or rants about the show please spray them at email@example.com. While you’re going to all the trouble of shipping us some bytes, please consider taking a moment to let us know what you’d like to hear on the show in the future. Despite the all-caps flaming you will receive in response, please know that we are sincerely interested in your feedback; we aim to appease. Follow us on the Twitters: @ObservyMcObserv.
Jonan Scheffler: Hello and welcome back to Observy McObservface, the observability podcast we let the internet name and got exactly what we deserve. My name is Jonan. I’m on the developer relations team here at New Relic, and I will be back every week with a new guest and the latest in observability news and trends. If you have an idea for a topic you’d like to hear me cover on this show, or perhaps a guest you would like to hear from, maybe you would like to appear as a guest yourself, please reach out. My email address is firstname.lastname@example.org. You can also find me on Twitter @thejonanshow. We are here to give the people what they want. This is the people’s observability podcast. Thank you so much for joining us. Enjoy the show.
Jonan: I am joined today by my friend, Matt Stratton. How are you?
Matt Stratton: Good. It’s a Friday, I can’t complain.
Jonan: I need to stop asking that question. I know better. I will say this for both of us. It’s 2020, and we are still alive today. So we’ve got that going for us.
Matt: Exactly, you showed up.
Jonan: Yeah. Right. I’m alive. I am on Earth. That’s how I’m doing. Matt, you have been in this DevOps space for a long time, and I’m so excited to have you on the show today, because I want to hear all of your thoughts on observability. I understand developers and DevOps people alike, although I kind of actually looped them both into the same category. I want to talk to you about that too, but I understand you might have some opinions about things like observability and we’re here to hear them.
Matt: I’ve been known to have opinions.
Jonan: Would you catch us up on what you’ve been up to during your career and recently?
Matt: Yeah, absolutely. I’ve been involved in the DevOps community for a long time. It was kind of funny. There’s been that meme going around the, “How it started; How it’s going.” I found one of my first tweets ever about DevOps from 2012 or whatever, that was me basically being like, “Well, who patches the servers in DevOps?” And now I’m the global chair of DevOps Days. So things have changed. Maybe I’ve come around on it. I spent most of my career—a couple of decades—working in Ops and then really, was always kind of very interested in a lot of the release engineering stuff.
And you know what I’ve been doing over the last, however long, years I’ve been part of this. So I’ve been very involved in this podcast called Arrested DevOps, so I’m very familiar with podcasting. Like Bryan Berry once said, “The dirty little secret about tech podcasting is it gets you the ability to talk to somebody for like an hour that you would normally never get.” And so just by doing the show, I got really dialed into a lot of things people were doing. I spent a bunch of time working at Chef, which we won’t talk about right now, but what’s going on with Chef. But that’s really where I got a chance to work with a lot of really large enterprises who are going through DevOps transformations.
In the last few years, I was at PagerDuty doing DevOps advocacy and really focusing on thinking about how we are learning from incidents, how our incident response is connected to this DevOps thing. And since March, I’ve actually been at Red Hat and I’m a Transformation Specialist in our, we call it the NAPS Transformation Office, which I love because it makes it sound like my job is helping people go to sleep better.
Jonan: I like that better.
Matt: It is, it is. I transform NAPS, but it’s the North American Public Sector. So I’ve been focusing a lot on government and public agencies here in the U.S. who are trying to do this kind of thing. They’re trying to get better about how they deliver and operate software for their constituents. And the fun thing is, it’s not all that different than the time I spent in the commercial space. Everybody has the same problems and the same challenges. It’s just that some people are using them differently, I suppose is the way to say that.
I will say it’s really kind of funny that we talked about being PagerDuty and coming into the public sector—nobody in the public sector has ever heard of PagerDuty, because PagerDuty just doesn’t operate well, they’re a SaaS and they’re not like in GovCloud and stuff. So government agencies can’t really use a lot of these SaaS tools.
Matt: So people that have lived in the public sector, they just don’t know. They’re just not aware of it, but in the commercial side you say, “I was a PA, I know PagerDuty.” And in the public sector, they’re like, “Wait, what is that?”
Jonan: To someone who doesn’t know of PagerDuty, it’s hilarious to even say, “Yeah, I got paged. You sound like you’re carrying a physical pager.”
Matt: Well, and the funny thing about that is the term paging, predates beepers. So it’s actually about—if you think about it—announcing things. So you’d be paging people and through those kinds of tubes—like in a store, where there would be the little announcer tube or whatever that would then come through in different parts of the—that’s where paging comes from.
Jonan: Right. We used to say, “Paging Jonan to the thing.”
Matt: Yeah, exactly. So it is kind of funny where it’s like, “Oh, it’s about that outdated technology.” And it’s like, the technology—pagers—that is true. Pagers are old.
Jonan: Pagers are way old. Did you actually own a pager?
Matt: Oh yeah. I had lots of pagers. I actually had one when I was in college, and nobody ever paged me on it—but I had it. Being on-call through the ’90s and the early ’00s, it was those two-way pagers or the text pagers, and that’s how you get paged that this particular server’s down or whatever.
Jonan: And people were used to sending each other coded numbers—“I’ll send you a page with this thing.” And that means, “Call me as soon as you can.”
Matt: Exactly. So tech speak is pager speak.
Jonan: Awesome. So I want to talk about that question. It occurred to me early on here. I’ve had a lot of conversations lately. We’re both involved in advocacy and dev relations, the kind of stuff that people try to convince them that DevOps is not distinct from developers. When I talk about developers, from my perspective—again, I have nowhere near the experience on the SRE side of the house that you have. But from my perspective, DevOps is about that specific thing. Hey, look, we are all developers. We all own this process. Let’s all work together. And that we’re getting rid of the world where there is an application developer who builds a thing as fast as they can and throws it over a fence to QA. And then it’s not their job to care about it, continuing to work. And then QA says, “Yeah, this is good enough.” And then it’s Matt’s job to keep that thing online and production. That’s a ridiculous world. And all the way along that now I am saying we’re developers, but I did. I asked the internet on a Twitter poll once, and 82% of people agreed with me, but I’m very suspicious about the remaining 18% who feel like DevOps is not the same as being a developer. What are your thoughts?
Matt: I mean, it really comes down to—and I’ve said before—that I think DevOps is unfortunately named. If we start talking about DevSecOps, it’s been DevSecOps the whole time. Right? Every time we shoehorn another syllable into the portmanteau, I’m like, “No, it’s always been that.” And the story is it’s called DevOps because agile system administration was too long of a name for a conference. When Patrick Debois and Andrew Schafer wanted to create DevOps Days, the first conversation was about agile system administration—but that’s too long: agile system administration days. It doesn’t roll off the tongue. So I get it. I’ve seen a lot of different definitions of DevOps and have a lot of opinions about them. But one, I love that it’s really kind of oldest.
So John Lusis once said, “DevOps means never saying, ‘That’s not my job.’” And it doesn’t mean everybody can do everything, but it’s really just about being cross-functional. You still have people that have expertise and skill. And again, if we want to kind of simplify it to development and software engineers and operations folks, and extend this to all the different pillars if you want it’s where can we learn from those other practices. And that’s one of the things I think that’s really great when we think about DevOps is that software engineers have spent a long time getting really good at doing a bunch of stuff.
And a lot of that is stuff that those of us from the Ops side of the house can learn a lot from a lot of really great practices.
And then likewise, there’s a lot of stuff that in Ops we care about that traditionally, a software engineer wouldn’t. And so it’s really about collaboration. And the reason we always stress empathy is because when you have those kinds of siloed teams, you just don’t know what all those other things are, and it makes it hard to do your job in the context of the rest of the needs.
So in DevOps, like someone who’s writing code who’s shipping features and doing that stuff, they need to have an understanding of what it means to operate this stuff. It doesn’t mean they have to do it—KnowOps is not a thing.
We’re not saying that DevOps means everybody has root, and developers do everything and everything like that.
Jonan: I want root though. It’s like my favorite user.
Matt: Yeah. I’ve discovered some conversations where Ops would get very protective of that. We’re like, “Well, we can’t give root to developers, because they’ll screw it up.” And I’m like, “But every single Ops person that tells you that you get them over a couple of pints, I’ll tell you about every time they completely jack production.” Because they made a mistake too.
Matt: So there’s nothing mystical about having an Ops hat. That means you don’t like fat finger commands. And I believe it’s really about how working together and we don’t have to get into the whole “Is it a title?” because I’ve come around on that. Here’s my thing: First of all, if your title goes from Sysadmin to DevOps Engineer, you’re looking at about a 30 to 40% pay bump, So I’m like, “Yeah, go get paid, man. That’s cool.”
The other thing is that it’s been sort of brought up—Ian Coldwater has said this a lot, and I agree with them—is it gets a little gatekeep-y, because the thing is a lot of times the people that have the title DevOps engineer, they didn’t give it to themselves. So when we’re out there crowing on Twitter, “Oh, I hate people called the DevOps engineer.” If you’re a DevOps engineer that makes you feel like you’re less than, and it wasn’t even up to you.
So I always look at, if your organization has a DevOps team, I treat it like a code smell, doesn’t mean it’s wrong, but it makes my ears perk up and I start asking more questions.
Jonan: And that’s what we should do. Dig into it a little bit and understand why they ended up with that title and whether or not they’re being served by it. I mean, they’re definitely being served in the broader community. We have a similar argument about developer evangelism and developer advocacy and where all of these pockets fit. I feel like they are all very distinct roles. I have an idea in my head of how that is, but the industry as a whole, we have not all agreed on that. So to get out there and say, “Developer advocacy is what I do.” And I don’t evangelize to people. I’m an advocate for them. It’s really crappy. I mean, they didn’t choose that job title. They’re doing similar work a lot of the time, but I agree with you a hundred percent.
So the thing that we care about as developers, my background is mostly in application development. But you had mentioned that occasionally there are things that we care about differently. I feel like a lot of the technologies that are evolving now around the DevOps days are things like Kubernetes. That forces me when I deploy my application to care about the resources that I need. I have to actually think as the developer of the application, putting together my Docker file or whatever other assorted Jamo manifests I am creating.
These are the resources I need. This application requires this much memory to run at scale in a single node.
So what are some examples? I’m trying to imagine the things that you would care about in today’s software environment, on the Ops side of the house that maybe hadn’t occurred to me to worry about. I mean, besides resource management, there are a lot of things that if we’re sitting down to have this meeting and I’m going to ship my app, you’re going to think of that I’m not going to think of. I just kind of want to highlight it for application developers who might be listening.
Matt: So there are a couple places that I think it comes to mind. And I know we’re kind of talking observability on this show, so I’m going to say the “M” word and I’m going to say monitoring—but monitoring still matters. And it’s about understanding. And even if we take it to another level and be a little more moderate in our talking, thinking about things like service level objectives. Like, how do we understand? Because that’s the thing: If I’m operating the software, I want to understand what it means to be good. How do I know things are going right? And thinking about now, I’ll use some more dated examples, because hopefully people are better about this, but honestly, maybe not so much as even thinking about things. And again, oversimplifying to things like logging.
So I remember—this was a good, 10, 15 years ago—but we had a web service and in an e-commerce company that I ran TechOps for, and it was throwing 14,000 error messages a second and they were false positives. It was just buggy thing.
And the problem was, it was actually really challenging to get that fixed because from an operation kind of developer side, you’re just like, “OK, well, that sucks.” But people are still able to do what they need to do, but on the upside, you’re kind of like, “It’s a huge amount of noise.” It’s a normalization of deviance that comes up. Like when you get that same problem when your tests are red and you’re like,” Oh yeah, that test always fails.”
Well, what happens when it really fails you ignore it? So thinking about that matters, and that also comes into where that understanding. Because once we were able to show the feature team what this meant to us and Ops by having all that stuff they went, “Oh, wow. That sucks.”
So thinking about, “How do you know what good is, and also your traceability, your understanding of what are the things that we can do?” And this goes into observability, but that’s sort of a next step of, “OK, we know there’s something to look at.” I know it’s not always just about when you have a problem, but that’s great that I can have observability to help me answer questions I didn’t know to ask, but at some point somebody had to have gotten paged. Right?
Matt: You can’t set an SLO at that level to know how do we get in front of this before our users become unhappy, so thinking about that kind of thing.
And then also just knowing what it means, again, like you sort of talked about like, what does it mean to deploy? What does it mean and how does this interact with other components of our infrastructure? What does it mean to actually spin up these instances?
Because sometimes, if you’re not exposed to any time type of capacity or things like that, everything just seems like it’s free. Funny story: I remember when I was at Apartments.com, there was a group that would go out and take pictures of the apartments, and all this stuff. And so these were very large image files that we would store. And we went to sort of the head of the photography department. I was like, “Y’all are using a ton of space on the SAN and I need you to get rid of stuff that you’re not using.” And the manager of that group said, “Oh, I thought if it was on the SAN, it was free.” And I was like, “Well, number one, it is not free. Number two, that is actually the most expensive place you can put your stuff.”
And the cloud has exacerbated this because it’s that elastic, on-demand perception of always available and always expandable.
You know, we aren’t necessarily thinking about the kind of stuff that duck-billed group’s going to help you figure out your billing with, and Ops seems to know that we’re a little more connected to whether it’s an actual dollar cost or even just a resource cost.
Jonan: Yeah. That makes a lot of sense to me. Cause I’m entirely disconnected from a lot of that billing in the work that I’m doing. And I think actually our industry is almost designed to create those situations for people. That’s what SaaS is.
You put your credit card in here and then kind of forget about us till you get a $14,000 bill for one month of whatever it is—that’s across the whole industry. Everyone knows. That’s happened to them or their friend at AWS.
But that’s what we do, this usage-based billing. And it comes very easy for us on either side or as developers generally to just say, “I need to think through each of the steps that goes into planning to put this out into production,” it becomes harder to look at it from that perspective when we’re able to easily deploy applications and not really be concerned about their memory constraints anymore.
By any standard 20 years ago, Rails was heavyweight. And today resources are much cheaper, and I don’t care as much on that level about running things in production. I mean, yeah, that would be awesome if I wrote something very performant in C, I built my whole web application from scratch. It could be all the frame, but it doesn’t really matter as much anymore. What matters is shipping, because—let’s be honest—five years from now, the odds are pretty good that your company today is not going to be a company anymore, unless you can get over those hurdles and get out there and their production.
So you said a couple of things there that I just want to clear up for our listeners who are maybe less familiar. What is A11Y?
Matt: Oh, so that’s me shortening observability. Like that it’s not an initialism. I can’t remember the term, but the same way, like we call Kubernetes “K8s” or accessibility is Allie. A11y observability O11Y. So it’s like you’re taking those middle letters, and there’s 11 of them. So it’s the letters between O and Y and observability there’s 11 of them.
Jonan: The first one I ever heard of was i18n, internationalization shortened to that.
Jonan: When I first heard of observability, I had some pretty strong objections. I have been starting to look into this a little bit. It struck me as a marketing buzzword in the beginning. I think to some degree, it still is a little bit of that, which is, I guess, maybe not a great perspective for someone who runs a podcast about observability called ‘Observy McObservface.’ But that’s part of why I named it, that the ethos of that is actually more involved than just, “Here’s a new thing that we are calling MELT.” You know, like this is all-encompassing. Sure. But it’s more of a philosophy. I wonder if could get your take on what the difference is?
Matt: Yeah. When you look at what observability can do for you, and distracting away from platforms and tools and implementations, is monitoring and things like that is always about answering a question you already knew to ask. Like this is a thing I know I care about and I’ll give it a historical example. I had a CTO once where this would happen: Something would go squirrely with the application and we’d kind of respond and firefight and do whatever. And she would always come to me and say, “Well, Matt, why aren’t we monitoring for that?” And I said, “Because, Pat, until last night we didn’t even know what could happen.” So observability, it’s not going to be predictive and tell you that something terrible is going to happen that you never thought of.
It’s sort of like logging. Like if it’s just straight-up logging, you’ll sit there. And you’re like, “Wow, I wish I had a time machine, so I could go back six hours when this incident [occurred], and start logging this particular counter so I could see what had happened.” So observability is giving us the ability to ask questions that we didn’t in advance know that we might want to ask.
Matt: So it’s giving us that discoverability to say, “OK, how can we kind of piece this apart?” And especially as our systems are so distributed, now it’s not your LAMP stack where the chain is three things, and it’s easier to trace what all happened. There are all these pieces, and you don’t always know what’s going to happen.
Like I said, you didn’t until last night—you didn’t even know that a user could decide to act this way. And then you want to be able to sort of go back in time and watch them so you can see like, “Oh, well this particular event happened and these were all the things connected to it.” So it really helps satisfy a curiosity, maybe is a way to think about it. There’s a lot of cognitive distortions we do in this industry, especially one around fortune telling—we think if we only had enough information, we could predict the future, and you cannot predict the future.
So we always have to have that ability to say, “OK, unfortunately, we don’t have a time machine. So I can’t go back in time and do this particular thing. And also I can’t predict every single thing that’s going to happen.” That’s analysis paralysis. And you’re going to spend a lot of time trying to predict every failure mode, and you’re going to miss some anyway. So you better make sure you’ve got a way to answer those unknown unknowns. And that’s where resilience comes in. It’s our ability to have an adaptive capacity to something that’s not a well-modeled disturbance.
Jonan: I want to ask you about resilience engineering specifically. But I just want to highlight that, from my perspective, this has long been that the most important tension in the software world is between software leaders and business leaders. We’re in the business world. You can absolutely do carefully calculated assessments of potential risks. But on that list of risks, software, and, and all of the technology is always at the top.
The business leadership comes to software teams and says, “Well, how long is it going to take? And how much is it going to cost? And how do we know it’s going to continue working?”
“We don’t know. We don’t know. We don’t know.” It’s hard to predict these kinds of things in software because we are trying to predict unpredictable things. When we’re going around and putting out individual fires, then we’re chasing in an unwinnable race, really.
But if we are able to, in a modern observability world, collect all of the data in case we need it and hold it for some period of time. And we’re very, very close to achieving a world where that’s financially feasible to just hoard this data. And you store it carefully and you keep your timelines. But that is certainly an approach that people take. It reminds me very much of your statement about being unable to predict the future and not having a time machine to go back to the past. Reminds me of this idea of a Zen presence—being here in this moment. And that’s what we’re trying to do. We’ve got to be right here right now and do what we can in the moment to solve our problems and prepare ourselves for the future. So tell me about resilience engineering and how that plays into that story.
Matt: Resilience engineering is a relatively old practice. It’s 50, 60 years old. And it’s usually a lot more connected outside of tech, but like everything else in tech, we discovered something and decided we invented it. But it’s a very mature practice. And we usually think about it, a lot of the times when you’re thinking about things like Sidney Dekker and the Safety-II Model and stuff like that, and we’re only now just starting to discover this in tech, where resilience engineering comes in. And the thing about RE and just resilience in general is we also— I’m not always super-pedantic about words, but there’s a couple I do get pedantic about and resilience is one of them. So people will say, “I want to build a resilient infrastructure.” And the answer is you actually can’t do that, because resilience is the ability to flex outside of well-modeled disturbances.
And so resilience only comes from people.
Now you can have something that’s robust, you can have something that’s reliable. When we think about high availability and fail over and stuff, that’s all about being robust. That’s a well-modeled disturbance. And that disturbance is just a thing that failed.
But resilience is how we can flex to then rebound, and a great example we’ve had, like with COVID, that’s what I was going to go back to. You said in the business, they’re so good at modeling. It can catch you unawares. So when we think about resilience, it’s about remembering that the systems—we’re talking about our sociotechnical systems. So the systems that we’re managing and using provide service—a reliable service to our customers and our users—are made up of technology. They’re made up of technical systems, but there’s also the people part of that, the “socio” part of the technical system, and that’s where the resilience gets expressed. And so David Woods talks says, “Resilience is a verb,” which means it’s a thing that you’re doing. It’s not a thing that you have.
Amusingly, one of my 10-year-old sons hears me give talks a lot. And the other day, he said, “Daddy, I heard you say that resilience is a verb, but I looked it up and it’s a noun.” And then my 10-year-old got treated to a 10-minute lecture about what true resilience really is.
Jonan: I’m sure he was thrilled.
Matt: Yeah. He was kind of interested so when we’re thinking about that, it’s about how we respond. So a lot of the things around expressing this resilience comes into things like when we think about incident response. And then also, how are we learning from incidents so that we can then have a graceful extensibility, which is how we take this and take something we learned from something that happened that we didn’t expect and not ask, “How do we prevent it?” I’m very big on not using the word prevent when we come to incidents. When you have some type of outage or something, the first thing that your leadership says is, “How are you going to make sure this never ever happens again?” And the answer is, “I can’t make sure it will never ever happen again. But what we want to do is we want to mitigate the impact if something similar happens again.”
And that’s all coming from the people. Are we able to get the right people involved as quickly as possible so that we can respond? Are we able to model our service levels well enough so that we start to see the hints that things are going South before users are really upset. So, Dr. Jennifer Petoff—who’s the editor of The SRE Book and a Program Manager for Site Reliability Engineering at Google—and I accidentally came up with a term on a podcast that we called ‘the hadness point.’ And it was a portmanteau, a combination of happiness and sadness, but it’s that exact point that your users are about to go from being happy to being sad. And that’s where you want to set your objectives, because you don’t want to be responding when everybody’s happy all the time, you do it too tight. You’re going to be waking people up in the middle of the night to respond to something. That’s not a problem, but if you set it too loose, you’re going to have a lot of unhappy people. And your monitoring is going to be actually sentiment analysis on Twitter, which is not how you want to monitor your service level.
Jonan: Yeah. Now we’re into my world. That’s the kind of stuff I have to deal with. I’m taking that, for sure. You’re talking about preventing versus mitigation. And I think that’s really an interesting way to look at it. When I talk about building microservices architectures or all of these, this is there a whole other ball of wax right now, especially, but you talk about using patterns like the circuit breaker pattern. And that’s what I think about mitigation—where I have a service goes down, and it’s part of a network of 40 different things that I’ll need to talk to each other. And when that service goes down, all other 39 of them start pounding the heck out of it being like, “Wait, are you there? Are you there? Are you there?” And it can never stand up again.
And so we implement things like the circuit breaker parents—“Hey, maybe if you don’t hear back from that service for a while, just give it a break, see if it will ever come back online,” because then Ops is over, they’re furiously trying to stand up this thing and it’s just getting like DoS in the house, and you can’t ever get the server back online. We think about those things sometimes in a software perspective. But the part we are trying to prevent, I think, is that business impulse—and you are absolutely right, in that no one is actually able to see the future of business or tech, but the tension that exists there, I find very interesting in, in the same way as this socio-technical term that you used. It reminds me that these are systems that are made by people, for people. Hopefully not of people. This is a very important part is that they are not made of us. They chew people up as we go. Like it’s easy to build.
Matt: People are a part of them though. Think about the people that are operating this, the people that are responding, the people that are managing—that’s part of the system. So when we think about systems thinking, these are complex systems that are, yes, made up of your Kubernetes clusters and your pods and your containers and your load balancers. But they’re also made up of that SRE, who’s carrying a pager, who’s responding, they’re made up of the developers that are being pulled into an incident. They’re made up of the business people that are making these decisions.
So systems thinking is really hard, because as humans we’re wired to go for a simple answer. This is the problem with the term root cause. It’s that one thing, if we could ask enough “whys” to get down to that one thing, and we fill up that lever—that metaphorical lever—then we’ll solve racism or something. You’re like, “No, it doesn’t work that way. It’s a complex system.” And the same thing is true with the systems that we build to provide service—those systems are made up of not just the technical components, but they’re made up of your marketing people, your salespeople, your support folks, and also your users are part of that system, because they’re impacting it. Everybody’s actions are connected into that distributed system.
Jonan: But also mostly made of meat. Yeah. This root cause analysis thing. I know that a lot of these terms and phrases that have been used in traditional systems administration are falling out of favor. I want to know about your thoughts on retrospective analysis. Like you think that is useful to have a retro after an incident. Maybe there’s a thing that you all do very well. I think it’s really valuable.
Matt: It’s super helpful. Super important. Because as we think about these complex systems, the idea of learning from incidents is absolutely categorically required in the world we’re talking about—if you want to be able be resilient, you have to learn from these incidents. And a learning is not just, “We created a bunch of JIRA tickets to do some stuff.” It’s not about action items. So we talk a lot about like, what do you call them? And like a lot of people don’t like you using the term “post-mortem,” because there’s all sorts of reasons. And basically in my time at PagerDuty, the reason we continued to use that as there’s not necessarily a more industry-accepted standard, there’s a bunch of other terms. So when I’m using the term “post-mortem” here, you can call that “after-action report,” you can call it a “learning report.” You can call it an “incident,” “retro,” a “PIR,” whatever, whatever those things are. The only one that I’ll get upset at people call it is a “root cause analysis” because that’s not about learning, that’s about finding something, whether it’s a person or a thing to blame.
Jonan: And usually a person.
Matt: Yeah, usually a person. And also because our systems are actually built really well. So Dr. Richard Cook has a great paper called “How Complex Systems Fail.” It’ll take you maybe 10 minutes to read.
It’s a short paper, but one of the things he talks about is that we guard our systems against failure so well that a true failure is never going to be one particular thing. It’s a combination of smaller failures together that bring us to catastrophe. So when we think about identifying the root cause, there actually really isn’t one. So we instead think about the contributing factors. What were the contributing factors that led to this kind of unique order of events or combination of smaller failures, all these contributing factors that combined to form something terrible? And that’s sort of the danger with some of these more legacy, if you will, modalities like the five why’s. Because the idea behind the five whys. If I ask why enough times, I’ll get down to a root cause and people will be like, “We say root causes and roots don’t mean one.” They could be the roots of a tree and I’m like, “But the thing is that may be true, but most people hear root cause and they think it’s finding one thing.” So I always try to refrain to the word contributing factors because it’s multiple. And there are multiple factors that contributed to it and the focus has to be on learning. Because a lot of times when you do a post-mortem, what you’re really trying to do there is provide an explanation probably to management or to your customers. And that explanation has a business value of some kind, but it doesn’t actually help your organization and your team learn to do things differently.
John Allspaw has a great paper called “The Infinite House.” And so when you’re thinking about doing retros and the only reason why I kind of sometimes shy from the word “retro” is it’s got, it’s overloaded for things we do in development, because it’s so connected to a project, a sprint or something, where we’re looking back. And usually you’re retroing more around your process than something you learned your technology or your system with.
But call it a retro or whatever you want to call it, it’s fine. When you’re doing that, you don’t want to ask why questions, because why questions tend to lead you towards blame.
Why did this happen? Well, because you know, Jordan didn’t know what he was doing or whatever, but it’s how, how did this happen? How did this happen?
If you dig into the ideas that Sidney Dekker points out in a “Field Guide to Human Error” and “Drift into Failure: From Hunting Broken Components to Understanding Complex Systems,” is this thing called “Safety-II.” So the older thoughts around safety were more focused on bad actors. It was someone doing something malicious or they weren’t qualified—so it’s “find the bad apple, the bad actor,” and in Safety-II, it’s not to say that those things don’t happen, but more often than not, it’s not coming from a place of malice.
And it could be that you don’t know how to do a thing, OK. Well just pointing that out doesn’t fix the system. So it’s like, “OK, well, the reason this happened was because the software engineer dropped this table in PRIDE.” Well, OK. So I know that and if I stop there, how do we help? But we want to lose. How were you even able to do that?
Jonan: Exactly. How did you put in there to flush the reddest cash and not know what that command did in the first place? And why were you on the system?
Matt: Yep. And what guardrails can we put in place? Or what checks, what are the things we can do? Because human error is going to happen. And human error is never the root cause. Because getting to that doesn’t get you very far. I like to say, “You can’t fire your way to reliability.”
Jonan: You can do the opposite. Yeah.
Matt: People think you can. That is the problem with Safety-I is that’s the idea is someone makes a mistake. And here’s the trick of that. If people are afraid of being punished for making mistakes, they don’t make fewer mistakes. They become subject-matter experts in hiding their mistakes.
Matt: And now you’re really screwed because you don’t know what’s going on. There’s a great, documentary called “From The Earth To The Moon.”
It was a Tom Hanks-produced show about the Apollo missions. And there’s in the episode about how they built the lunar lander. There’s a part where one of the engineers basically made a mistake and it puts them back in time by a bunch—it is very, very impactful. And he goes to the boss to fess up to this, expecting a 100% to be fired.
And he got a dressing down, of course he did. But they were like, “No, keep going, because you have knowledge.” And that’s the thing. We make this mistake all the time—when somebody is the person who made a mistake or something like that, our first instinct is remove them as far as possible from this. And the problem is, they’re the ones who understand what went wrong the absolute best.
So do you remember a few years ago, there was an issue where there was a false alert on a missile launch from Hawaii—everybody got alerted on their iPhones—and what the whole resilience community was watching very carefully to be like, “How do people react?” And they reacted exactly the way you expect, which was, “Oh, the person who accidentally pushed that button. He’s not part of that group anymore.” And we all went, “Oh my God, that’s the person who could tell you the best about what happened and figure out how we can mitigate this.”
Jonan: Yeah. Hire that person to lead the department now because you’re taking away the information that you have in the same way that you’re describing that loss of psychological safety that comes on a team when people are afraid—you are not allowing people to do their best work. First of all, nothing stifles creative thinking, or self-actualization, better than living in absolute terror. This is not a way to run a company. You do not want fear-based people working in this space, but then you take the person who got in trouble—Jonan—and got in there and dropped the production database. Because he didn’t know what he was doing. And shouldn’t have had production credentials in this case, and now we’ve fired Jonan, but I’m the best person to write the educational materials for that onboarding process and adjust the actual system that exists that prevents that from happening at all.
So if we were to just look ahead a little bit, given that you like predicting the future so much. I want to hear what you think the next year brings. If I have you back in a year, what are we talking about? Can you just guess, like there are a couple of things that are growing rapidly? I think the CNCF will be bigger and have more logos on its page. That’s a prediction.
Matt: We will have more of a road map. It’s funny. This is a very weird time to ask any future prediction. I could postulate. I’m like, “Don’t even ask me what’s going to happen in two weeks.” I think what we’re going to see more of this idea of moving towards platform and commoditization, because it’s been happening for a while, but I think it’s becoming more and more powerful and more and more necessary.
So if you think about the mapping stuff like Simon Wardley likes to do about like, “OK, what are the things that are providing you with your business value? What are the things you should be buying or consuming, because they’re commodity versus inventing?” And so when you think about platforms and you think about something like Open-Shift or whatever, the idea there is focus on how you’re building and deploying and running your applications.
I tell a lot of my customers—and it’s funny, because now I’m talking in the public sector—”Are you the US Department of Pipelines?” So if, you’re not, why are you building a custom pipeline tool? And this doesn’t happen as much in the public sectors as doesn’t in private.
I had a lot of customer, who’d be like, “Oh, well we, we throw out PagerDuty. Because we’ve decided to build our own.” And I’m like, “You sell shoes. Why are you doing anything that isn’t serving your business?”
And when I give talks a lot, I’ll say, “Do you know how your company makes money? If you don’t, go find out. I’ll wait.” Because if you don’t understand the business outcomes that are connected to what you do, how do you know that the things you’re doing are serving that? That doesn’t have to be about making money, it’s a simplistic one. But I tell people in the public sector, “What is your agency’s mission?” And the mission is not Kubernetes. Your goal is not to install a whole bunch of Ansible. That might be my goal to get you to do that. But your goal is to provide unemployment benefits to the people in your state.
All of these things are enablers. They’re not the mission. And I think we’re starting to get that. And I think we’re seeing those abstractions towards more platform—we kind of wanted to do that with paths, but passes like a little too simplistic for the distributed world we live in now. But that’s really the same model. It’s all a Heroku at the end of the day. Like why did we want to do Heroku?
Because we just wanted to ship some stuff, and it’s not that everything’s going to be serverless or whatever, because there’s still value—massive value—and operational knowledge.
Jonan: Of course.
Matt: But it’s not value in writing shell scripts to ship files around. It’s about understanding what it means for this system to have this resilience and this flex. So I think what we’re going to see is more adoption of SRE practices. We’re starting to see that every practice around SRE fundamentally requires having good service level objectives and understanding of that, and that’s the thing that people tend to not do very well.
And I’ve seen some reports coming from Google and some of the research stuff that they’ve done, where we’re starting to see people being a little more mature with that.
So I think we’re going to see more of an understanding of thinking about things as services—not “as-a-Service,” but as a lowercase “service.” And then that helps us understand where things get moved around into our platform—where is our focus? So I think we’re going to see more of that. We’re going to see a lot more of people wanting to think they can rub some AI on everything.
Jonan: Let’s do a little on top.
Matt: And again it goes back to that cognitive distortion that if I only collect enough data, I can build a self-healing system. But when you look at those weird failure modes in complex systems, they always kind of require human creativity to make an intuitive leap. To sit there and say, “Oh, this and that.” And that’s not AI. That’s not machine learning. Those are all tools that can help us get there faster. But I think we’re going to see more of that. I think we’re going to see people making a lot of mistakes, thinking that they can automate everything, and they can throw a bunch of logic loops in it. And that will build the equivalent of an experienced SRE or Ops dev who can think through this stuff.
Jonan: Yeah. We just have to add more AI on top until it works. Although I do think that AI has tremendous applications in the space. All of these technologies are so new. It’s going to take time to get there. But I think we’re making quick progress.
Matt: I think we always need to look at it as something that is in service of the people, part of the socio-technical system—they are not replacing the people, they’re letting the people get to where they need to be faster. A year or two ago, I was talking to Damon Edwards at Rundeck about like how Rundeck’s automation could work in an incident with PagerDuty, and instead of it being like, “OK, you get an alert and it will automatically do all this remediation.” It’s thinking about what all the things that you, as a responder, want to have available to you. This is an oversimplified example: So I get paged in the middle of the night. It’s going to take me a few minutes before I log into the terminal. So during that time, could the automation predict things I might want to know, like queries to run, reports to gather, things like that, all the things we know? How can it make the people better? Not how it cannot replace the people.
Jonan: Right. I totally understand. People are very concerned that AI is going to take over the world.
First of all, I’m happy to inform you, it will be hundreds—f not thousands—of years before we’re even close. AI is bad, and we will always be bad, but it will always, always be worse than AI-plus-human. It’s an assistive technology. It makes us better, but it does not replace us. And it never can.
Thank you so very much for coming on the show today, Matt. I want to make sure people can find you online. Where can we look you up on the internet?
Matt: Probably the best place to find me is on Twitter at @mattstratton. I mostly tweet about DevOps and talk about that, but also just random fun. I guess fun is a subjective thing—but, yeah, follow me on Twitter. Let’s argue about this stuff. It’ll be a good time. Speaking of fun things, I run an online game show once a month called DevOps Party Games, where we run games with DevOps and cloud-themed content. So you kind of watch the live stream, and you can participate in everything. We have all sorts of fun people come on the show.
And if you’re interested in seeing when I kind of a flap my gums about all this stuff, my speaking portfolios at speaking.mattstratton.com. You can see past talks. I’m not giving as many talks as I used to. And now that I’m not as much in evangelism and advocacy, I’m not paid to talk about it all the time. Now if you want to hear me talk, you have to probably be a government agency, but I give talks every now and again. You can find some fun ones there. Also, as I said, I have a podcast, one of the longer-running DevOps podcasts is Arrested DevOps. You can find us anywhere that fine and less-fine podcasts are available.
Jonan: Fine and less-fine—I am going to change my intro now. Yes. Thank you again so much for coming on. It’s been an absolute pleasure talking to you and, I look forward to many more conversations in the near future.
Matt: Thanks for having me.
Jonan: Have a lovely day.
Thank you so much for joining us for another episode of Observy McObservface. This podcast is available on Spotify and iTunes and wherever fine podcasts are sold. Please remember to subscribe so you don’t miss an episode. If you have an idea for a topic or a guest you would like to hear on the show, please reach out to me. My email address is email@example.com. You can also find me on Twitter as @thejonanshow. The show notes for today’s episode along with many other lovely nerdy things are available on developer.newrelic.com. Stop by and check it out. Thank you so much. Have a great day.