In this episode, Kelsey Hightower talks about the new mesh revolution in Kubernetes, while teasing his upcoming book, Mesh The Hard Way. The idea is that we can write applications that talk to other applications, and that there’s a lot of stuff that happens in the middle of those two things.
Kelsey talks about what observability means and that, as it stands, developers have a tendency to collect metrics—but most of the time, they’re not really sure why they’re collecting them. He says that there is a balance between collecting data and being able to create something actionable, and that it’s what you do with data is where power comes from.
Challenges come and go, and you can get depressed about them and complain about them, or you can look at challenges as an opportunity to overcome things. You may ask yourself how you can keep up with all of the observability tools out there, but most of the tools we see today are no different than most of the tools that existed 10 or 15 years ago. They may have better UIs and workflows, but fundamentally they are all roughly the same.
Kelsey urges developers to take comfort in that, be patient, learn what you can, and go as deep as you can. More than likely, what you’re learning now will be applicable in the future. Understand that you have control over the pace of information you let in, and the pace of things you choose to adopt, and you can also take a break and it will all be OK.
Should you find a burning need to share your thoughts or rants about the show, please spray them at email@example.com. While you’re going to all the trouble of shipping us some bytes, please consider taking a moment to let us know what you’d like to hear on the show in the future. Despite the all-caps flaming you will receive in response, please know that we are sincerely interested in your feedback; we aim to appease. Follow us on the Twitters: @ObservyMcObserv.
Jonan Scheffler: Hello and welcome back to Observy McObservface, the observability podcast we let the internet name and got exactly what we deserve. My name is Jonan. I’m on the developer relations team here at New Relic, and I will be back every week with a new guest and the latest in observability news and trends. If you have an idea for a topic that you’d like to hear me cover on this show, or perhaps a guest you would like to hear from—maybe you would like to appear as a guest yourself—please reach out. My email address is (firstname.lastname@example.org). You can also find me on Twitter as @thejonanshow. We are here to give the people what they want. This is the people’s observability podcast. Thank you so much for joining us. Enjoy the show. I am joined today by Kelsey Hightower. How are you, Kelsey?
Kelsey Hightower: I’m doing great.
Jonan: Yes. I should learn not to ask that question at this very moment in time [chuckles]. I know better, but it just keeps coming out of my mouth. Let’s come up with an alternative. What can we ask each other instead of, “How are you?”
Kelsey: You know what? I think it’s an OK question. I mean, you know, you want to check in, and most people know it is a bit of a rhetorical question. If I’m not doing great, I would just say, “I’m doing fine,” right? I’m alive and breathing. And for a lot of people that may not be the case. So I do appreciate the very baseline of those things and things could be better.
Jonan: I really appreciate that grateful attitude actually. And I will do my best to embrace it going forward, that’s a wise take. So we are here today to talk about observability, and I understand you might have some thoughts on observability. You’ve been around, I want to hear a little bit of your background for our listeners. If you don’t mind terribly, catch people up on what you’ve been doing.
Kelsey: If you’re thinking about most recently, people may know me from my work in Kubernetes; before that, you may know of my work in the container space at Cox, Before that, maybe some of the work I did in configuration management with my time at Puppet Labs. But before being a contributor to a lot of these projects and philosophies people are adopting, I’ve had a lot of these roles in jobs, too. I’ve been a system administrator. I’ve been a developer trying to build apps and make sure that they run well in production.
And when you think about what I’m doing, in the last two or three months, I’ve been working on a thing called Mesh the Hard Way, so it may become a book. The goal is that I want people to understand. When I wrote Kubernetes The Hard Way, the goal was Kubernetes is becoming this popular container orchestration platform. It’s this huge distributed system. But what makes people uncomfortable about it, it’s not the fact that it’s new. It’s the fact that they don’t know how it works. They don’t know how all the components fit together. So recently I’ve been taken to task, learning all the components of a service-based show.
When you think about a service mesh, we’re talking about things like Istio or Linkerd. The idea here is that I can write an application. Maybe it talks to other applications, and there’s a lot of stuff that happens in the middle. Things like security, TLS Mutual auth, authorization, authentication. And for a lot of people like the topic we’re talking about today, there’s a bit of observability, whether that’s structured logs, HTTP traces, or metrics that you gather with something like Prometheus. So right now, I’m just going to try to learn everything about those individual components and distill it in a way that other people can too.
Jonan: This is awesome. I’m really excited to hear it, because I came into software later than you. I’ve only been around for about 10 years. And I really only got into systems work after the DevOps revolution was already well kicked-off. So I’ve never been tasked exclusively with systems administration. My understanding of these concepts certainly is not as deep as yours. But from an outsider’s perspective, learning about the Kubernetes ecosystem already, it seems quite complex. And specifically this new mesh revolution, a lot of people are talking about this thing, but I have yet to find a concise definition of what it is, and especially how I should be using it effectively. So I’m really looking forward to it. I thought it’s going to be a book, or a course called Mesh The Hard Way?!
Kelsey: Yeah. I typically like to use multiple outlets. So there are lots of people who appreciate it. Something they can hold in their hands with a little bit more commentary, a few more diagrams, and a little bit of monologue for me. And that might take form in the course of a book, like a rope with the co-authors of Joe Beda’s and Brendan Burns’ Kubernetes: Up & Running. So they had a little bit more commentary, a little bit more visuals and graphics, and then also have a GitHub repository where under creative commons, I kind of share a lower level look at Kubernetes and building it from the ground up.
So I think Mesh the Hard Way deserves the same treatment, the form of a book. There’s probably going to be keynotes around it. And of course, that raw GitHub tutorial that you can just pop open and just take your time and work through all the level components and walk away and say, “OK, now I understand how the moving pieces fit together.”
Jonan: This is exactly what makes you such an impressive developer advocate to me: your approach to content. And it is clear to me and the entire community, I think, that you are definitely putting the users’ interests first. It’s a very important thing for me in developer relations that we are first educators. Our role is to give back to the community, and you do an excellent job with all the content you create focusing on that. So I really appreciate it. So tell me more about this observability take that you’ve been around while we started even calling this thing, observability. I worked at New Relic about five years ago, I’m now back at New Relic and building a DevRel team for them. But since my return, this observability term has come into popular use, and it is designed to talk about monitoring and APM and errors and traces and logs and all of these things together. But in a very specific way, maybe you can give us your quick summary of what observability even is.
Kelsey: It kind of speaks to the power of words, right? Now that we call it observability, we have a place to anchor our thoughts in this area. So what does this mean? For most people—if I were to explain this to someone that doesn’t write software for a living, or isn’t a system administrator—if you’re driving your car, that panel is a form of observability. How much gas do I have in the car? How fast am I going? Now, I could blast you with your current tire pressure. I could also tell you what the height of your seat adjustment is, but those aren’t really that important at a given moment. Maybe I’ll only show those things when you’re not driving or your fuel economy has dropped, maybe it’s being contributed to because of your tire pressure. And that will just give you an alert so that you go and check your tire pressure.
So I think in many disciplines in the world, we already have observability. You go to the doctor and they take your blood pressure and you jump on the scale. Our human bias has a way of emitting metrics that can be measured to tell us things. In the software world, we always talk about our world being young and immature in many ways because we don’t necessarily have a clear set of golden signals.
Now I know people talk about that kind of thing and say, “Hey, here’s a set of golden signals.” You should know about your applications. Maybe we should tie those golden signals to SLAs, but there’s this huge disconnect, because we don’t know what everyone’s service level agreements are. We don’t know what promises they’ve made to their customers. Is this a hobby project? Is this a highly reliable financial application? And since you have this kind of bigness and disconnect, we’re not quite sure what metrics we need.
So for example, we go back to the car analogy—well, there’s a posted speed limit. So that’s kind of the contract. If you go faster than the speed limit, then there’s a chance you’re going to get pulled over and get a ticket. And there’s going to be some accountability, like a fine to pay or points against your license, and your insurance may go up. So then when we talk about a pedometer. Its goal is to show you how fast you’re driving. So that way you can actually make a connection between the contract and how fast you’re going. And you can adjust behavior by slowing down or in some cases, speeding up to match the flow of traffic. And in software, we don’t tend to always have a clear picture of what the speed limit is. So we tend to collect a lot of these metrics and stuff, but we’re not sure why we have it, so no one tends to observe. It just gets collected into the ether. And maybe we bring it back up when there’s something on fire.
Jonan: That is a really concise explanation of this problem. Basically, given that we work in the software world, it’s actually not terribly complex for us to measure things, certainly with the availability of the tools that we have today, it’s getting to a point where you can measure most anything that you want to measure within your application or your infrastructure. But the real question is how to extract value from that. And that’s where observability comes in.
Kelsey: Yeah. And that’s why it’s not called collectability. It’s called observability. And I think that also helps people understand that you can actually make a trade-off, you might start by collecting the thing you can observe. When I was coming up, before we started to call it observability, I remember it’s like collecting data on all your servers and just collecting all the metrics that the kernel will spew at you and put the metrics somewhere. And this important metric is now blinking at me, telling me that the hard drive is full from collecting all of these metrics, so we’re unable to collect any more metrics. That’s when you start to over-index on collecting metrics versus why you’re collecting metrics. And I think that’s what the term observability does for people, less than prioritize when and where to collect things, because hopefully you’re going to keep an eye on it.
Jonan: Yeah, absolutely. And I think deriving the value from it when you have so much coming in, there is definitely a balance between collecting the data and being able to create something actionable. When you collect too much data, you end up with this overflow of information and then someone gets a buzz—they’re on call, and they show up. And it’s very difficult to identify the problem in that sea of information. So rather than maybe the option of collecting less data, which I’m pretty sure people are not super-excited about. It’s nice to at least have a historical record of where things were with the system, even down to the height of your seat or your tire pressure, when things go wrong for diagnostic purposes. But what kinds of technologies do you see adding the value without trimming back the noise? Maybe we keep the noise, but we push it into the background, where it belongs. We use tech like AI and ML to pull out real actionable information.
Kelsey: So this is more about skill. One thing we’ve kind of proven to ourselves is that we can collect the data. And there are many industries that collect data about things. It’s what you do with the data, that’s where the power comes from. So in troubleshooting, having the data relatively available, and then knowing what to do can help you possibly resolve an issue faster. Oh, this error message is only coming from this IP address. It’s that additional metadata that points me in the right direction. And that’s great. I think a lot of software viewpoints are using data to make these kinds of decisions.
Some people scope observability to that. So then you can start to do things like predicting when a hard drive will fail, because given the signals that I’ve been getting, I can look at these historical values, and you can brute-force this. Run a big SQL query and hope it finishes on time, depending on how much data you have, or as you’ve mentioned, we have these email and AI ways of saying, “I can model these bits of data and predict when the hard drive might fail.” And it’s not going to be 100% accurate. But the goal might be, “Look, we know on average that when we see these things, things indicate this hard drive is going to fail within 90 days.” And we have enough data over large datasets.
So not just my data center, but multiple manufacturers under similar conditions that we have enough predicates to say that I can now build a model that will predict because maybe replacing a hard drive that still has a little bit of life left to it, is more valuable than waiting for it to actually crash and then having to do data recovery. So this is where I think observability—when you start to combine it with data sciences—you can start to make these kinds of decisions proactively that might be the right engineering trade-off based on using data to inform decisions.
Jonan: That’s an interesting perspective to have going into this. It seems to me that, a lot of the time today, companies are talking about every webpage you go to, it says artificial intelligence on it— no matter what the service is that a company is offering—it says machine learning and artificial intelligence. I really feel like in the observability space, we have an opportunity to take advantage of that and to use it in really intelligent ways, building these models when you have the massive data stores that, companies like New Relic have, we are able to actually do this kind of predictive modeling really well for our customers.
When we get to a point, say five years from now, when that becomes standard, maybe we end up with software products that ship with that capability, that might be something that is integrated with Prometheus, for example. That leads us to my next question, where I see this explosion of tooling and technology has grown exponentially complex in recent history. While there are platforms that offer observability right now in the near future, if we continue on this path, I think things are just going to be too complex for even an intelligent system to keep up with. What do you see happening over the next several years in the space that’s going to prevent that world?
Kelsey: So make sure we’re clear, we’re talking about artificial intelligence. So this is not a replacement for human intelligence. While you will be able to have a system that sees enough patterns and trends to say, “Hey, I predict that your next request is going to be slow,” but it won’t necessarily be able to say that it matters. Maybe it doesn’t matter. Maybe no one cares that the next request is going to be slow. And that’s why these are just tools to help people make an educated guess. If I know that I’ve moved my server to another country, then that changes a lot of the variables.
Of course, the latency is going to be high. So if your model was built based on where it was before, then this will like an anomaly. It may throw things off. And this is why we had to update these models. We had to adjust a few assumptions, and we have to have a baseline, what are we talking about? What’s the actual end goal? If we try to do this via just libraries and tools and dashboards, and we try to decouple the human part of it, then we kind of get in this weird spot. There’ve been cases where people tried to roll out AI, try to predict what’s the best candidate to hire from the stack of a million résumés, that just kind of gets away from what makes humans special. We have the ability to use more than data to make decisions. We can use intuition. We can use situational awareness. We actually have a context that is not backed by data.
Sometimes you’re just in a room, and there is no data to tell you what’s going on in that room, but you have to use your eyes. And maybe you can translate the images you see into data and map it to experiences. But we can do this at an amazing rate that is currently not rivaled by machines. Machines are really good at looking at large sums of data and making some type of prediction—what’s the next chess move? What’s the next go-to move? But whether an anomaly is a big problem in a larger system, a human still needs to make that final judgment call in some ways, so this is why we blink the lights. This is why the graphs draw those lines. So you can look at it as a human to try to make a decision based on what the system is telling you.
Jonan: I agree with you that the human element is irreplaceable. It comes up a lot with developers, whether we end up chatting about the future, 100 years from now, it’s very plausible that we have computer parts inside of our bodies. If Elon Musk has his way, it’ll be much sooner than that.
And we talk about this future where maybe we do have artificial intelligence that improves to a point where it is capable of replacing the human mind. And in my case, it has always been that’s a less powerful system than that AI plus the human mind—no matter where we are in that journey, the human element is irreplaceable. It’s always an enhancement, I think, for these kinds of intelligence systems that we’re building.
So I want to pull back a little bit here and talk about a couple of other things that are going on in software, the microservices move that was very popular for a while. I’ve seen kind of a snapback recently where people are saying, “Hey, maybe splitting our model with up into a thousand different microservices was adding unnecessary complexity.” And now people are advising that you stay with your monolith until you have a very good reason to pull pieces off and make them into microservices. What is your take on the trend in microservices versus the monoliths?
Kelsey: The size of a deployable thing you’re building is slightly a different argument from, “We are permanently in the services world, that’s permanent now.” You’re going to call out to PayPal. You’re going to call out to New Relic’s APIs, GitHub APIs, DNS.
Most people, even in your monolithic world, probably have a DNS server somewhere, domain name service. So you already have services. I think we’re already past the fact of not everything will be in a single binary. Our database tends to live outside of our application binary, DNS lives outside of it, etc. So we’re already past that. The internet and all this we’re in the services domain. So now when you’re writing your software—let’s say you’re writing application logic. Because we’ve already admitted that the database is outside, even your storage system is outside, whether it’s a full system being served by the kernel. Those are a set of storage services that are not in your application.
Now the app that you’re about to sit down and write—when I’m writing things by myself, there’s going to be no other contributor around. I can deal with that model. Because there’s only one developer in there writing good or bad code, that’s me. I can understand all the contracts, and I can reinforce contracts and so forth. So monoliths tend to be the right trade-off. I can go and re-factor, I can hold it all in my head, and I have no one to conflict with. In that sense, the monolith works well for the model or solo developer. Now, when we start to add two different people working on the codebase and, let’s say they agree on how to write code and style, and maybe they’re even equally skilled. In that world, maybe they have workflows that can be very compatible. “Hey, you work on this part of the codebase, I work on this part of the codebase. And when we have conflicts, we’ll resolve them together.” And they do a good job of it. Again, a monolith will be OK.
Now once you start to have thousands of people trying to build a system—let’s say you start to introduce an authorization service into this mix. And you start to try to borrow code from my implementation, like my user database. And then you say, “Well, we should change the user database or the user object, so that I can authenticate the user better.” I would say, “Well, if you do that, you’re going to break my internal reference to that user. And once it starts to hit, get this mess. We should figure out a way to either have multiple versions of the user object, or maybe don’t use my user object.” We get into ownership. What part of the codebase should own other parts of the codebase? And we have rules around encapsulation and so forth. Now, this is really hard to do in a single codebase. So what do we do? Some people will move into libraries. I’m going to create a user library and we can have multiple versions of that user library. I import version one, you import version two, and we’ll have some commonality between the two versions, so that name, user, and address can be compatible. Great. And then you get to a point where, when we go to deploy this stuff though, maybe we do introduce some friction: “Ah! I made a few mistakes, sorry about that.” I updated user two, but not user one. And now the whole system is broken. And the way we’re storing things in a database is causing corruption.
So you would say from an organizational standpoint, you may make the argument that the office service shouldn’t depend so much on the particular big, all-in-one thing that we have. You may get the idea that we should just split it out, and move the office service so we can actually run independently. And we’ll just have a very basic contract in terms of what a user is. But our representation of the user in our database could be very different than your representation of a user in your database. And this might actually work. But the biggest problem that I think a lot of people overlook here is that, if my service has its own definition of a user, and I try to authenticate that user in your system, there’s going to have to be some type of contract to say that my user ID matches your user ID. Then we have to figure out how to secure communication between my service and your service.
It might turn into a case where I can’t even develop my service without your service. When that scenario becomes true, you end up in a place where you’ve just built a distributed monolith, and you’ve introduced a bunch of complexity with very few of the benefits—but it may be worth it, if it allows your organization to run independently, and some people need that.
Jonan: That makes a lot of sense to me. I think this is what happened early on, especially with these microservice changes that people were making as organizations. I remember distinctly when I was working at LivingSocial back in the day, needing to spin up 14 or 15 separate applications on my laptop in order to work on one component of that system. And then we were in a place where we were building mock versions of those applications or doing a lot of work around stubbing the edges of these services to make sure that we could give back a fake response that would make it plausible, that we could continue development on the individual applications.
But I agree with you that that is exactly the piece that most people are missing—understanding the difference between a true microservices architecture and a distributed monolith, and whether or not it is worth it in any given moment is certainly up to the organization to determine. So with regard to observability in this world, we know what we’re trying to achieve with observability. We want to surface the things that matter. And the world of containers is complicating things quickly. There are a lot of different components that are operating in the Kubernetes ecosystem. If you look at that page of the CNCF projects that exist out there, there are potentially over 100 now, I think. I’m wondering if you know how many there are—maybe 150 different projects under the CNCF umbrella?
Kelsey: I mean, at this point it went from maybe an art piece hanging in a museum to a graffiti wall. Anyone’s free to come add their tag or whatever on top of it. So I think it’s one of these things where it’s like trying to highlight and celebrate all of these approaches, even their competing approaches, just how big and expensive that this ecosystem is. If you will look at it, it can be very intimidating to know that you could possibly make all these choices.
Jonan: And many of those choices overlap in their functionalities. So when you are looking at evaluating a new technology, what are you looking for personally? I assume maturity is something that you factor in there, but beyond that, what sorts of things do you look at when you are trying to choose a distributed tracing solution or a logging solution when you’re building out a system?
Kelsey: It’s funny. People already do this today. I do a lot of the grocery shopping. When I go to the grocery store, one of the meals I like is building my own Jersey Mike’s-style sandwich. So I’m walking through the grocery store, and I have tons of bread options. There’s, I don’t know, 50 different brands of bread. Individual slices and they also have the Italian loaf that was baked this morning. And it’s only good for three days. I have these choices to make. Do I go with this or do I slice off my bread and cut it, like I would get from a deli? But one thing is, once I’ve tasted one of them, I’ll try this one. I’ll try that one. But once I like one, I have brand identity. So maybe I like buying things that come from Amazon. Maybe I like buying things that come from Datadog, or maybe I like buying things that come from New Relic.
So a lot of times we will start to negotiate a bit of brand affinity toward various products. Because a lot of these things tend to have overlapping things. Now when there’s some unique thing in the market—I’m a vegetarian. And I know if I want something that tastes like a hamburger, then there’s not a lot of choices for me other than maybe Impossible Burger or Beyond, but these are going to be new things.
So I think Prometheus was a very new thing to people. So as you’re saying, which one do I choose? Well, again, remember, in the grocery store analogy: I knew I wanted to make a sandwich—bread, lettuce, some cheese, some bell peppers, etc. I had a goal in mind. So if you tell me, “I want to be able to collect metrics from my application,” then great. Maybe you have some fundamental things in place, so that should help you narrow things down. And then the next question you may ask is, “What kind of metrics do I want to collect? Do I want a lot of flexibility? Or do I want things that are like metrics from the Colonel? I don’t really care about customer metrics.”
So if I’m really thinking about, machine-level metrics, that’s usually covered by some kind of off-the-shelf agent; I install it, it picks things up and it will ship it somewhere like crate. But when it comes to custom metrics, that starts to scope down my choices a little further. So when I think about custom metrics, how would I actually write code to produce these custom metrics? Because, I have to make sure that the ecosystem of doing this—I don’t want to create my own object types by hand. Ideally, I’m going to see if there’s a great ecosystem. So this is where Prometheus starts to shine. They have great libraries. They even have framework integration.
Whether it’s Spring Boot or some other web framework that you’re using, they’ll tend to have some baseline, either open telemetry integration or something like Prometheus integration where, when I import that library, it’s going to give me some baseline things from the runtime. And it’s going to have some helpers to let me create custom metrics, and then bonus points. I may even have an opinion about how those metrics should be collected. And this is always a big debate, push versus pull. I’ve come to know that I liked the flexibility of having both. So for those that are not familiar with Prometheus, it comes with the concept of slash metrics. If you suck in the library and you register the handler, then you’re going to get a slash metrics component, that’s going to typically give you free metrics about your runtime. If you’re a New Relic user—I think this is the thing that blew me away about New Relic in the day, when I was at Puppet—you import one library and you go to that New Relic dashboard and the whole world lights up for you. Or permit these runtime metrics, how many go routines you have running or how much memory is being used, but then it gives you all these other helper libraries for your custom metrics and an opinionated way on how to expose them and a pull model of going by and scraping those metrics at these other intervals.
So when I look at that whole ecosystem as a developer needing to produce custom metrics, looking at the libraries that are available, looking at the knowledge base of some available, then I start to say, you know what? Prometheus is the right tool for me based on the fundamentals, based on the ecosystem. And again, all that equates to sustainability in my mind.
Jonan: Yeah. This is exactly what drew me to New Relic back in the day as well. And I’m excited to see solutions like that now implementing the ability to incorporate both sides of that equation—much like you said, many of the services that are out there that have this agent model where you install the agent in your application and then the metrics appear in the dashboard. They’re also able to integrate with your systems like Prometheus, so you can have the best of both worlds. You mentioned open telemetry there a little bit, and I want to catch people up on the progress that’s happening in that space, because that’s really pretty new, right? But there were two projects that merged into open telemetry maybe a year ago.
Kelsey: Yeah, that’s correct. Kind of like the Cloud Native Community, Google had OpenCensus, which was like when we think about having a framework model for multiple languages to collect and then ship metrics to different backends. And then we also had the open telemetry community, they were doing their thing. Maybe they’re going to a different name. And we’re looking at the goals of both projects and wondering if it really make sense to have multiple languages. Maybe one was focused on more attributes, like logging, tracing, and metrics. Maybe the other one is more focused on one or two of those dimensions. Those communities got together and said, “We can probably unify this stuff, there’s not that many differences between what we’re doing.”
Now we’re at this place where maybe open telemetry is a way forward to capture many of the goals from this library-based approach. And I’m emphasizing the library base, is that I need to do something in my codebase to get some value. And there’s another movement that I think a lot of people have been paying attention to—I’m actually an advisor for a company called Pixie Labs. And they’ve taken the approach that now with modern kernels, you can leverage things like eBPF in a way that the kernel now has a lot more insights and hooks into language runtimes and the networking model. I can actually start grabbing telemetry and data about HDP directly from the Linux kernel.
So, guess what? You now can be in a world where you think about this kind of auto-telemetry mindset of no libraries—maybe agentless isn’t quite the right word—but ideally, now you can hook in underneath these applications. So that’s a little bit different than the “compile-this-library-into-your-app” approach. Now it’s more about pulling things from underneath the runtime, and right in the Java community, you’ve always had the ability to go to the runtime and pull things out. But imagine now saying, “I can now go underneath that and pull from the kernel a lot of the stuff that was only available on some of the runtimes.”
Jonan: Yeah, it certainly simplifies implementation when you’re trying to get this observability into your large-scale systems. I imagine at a company like Google, there are plenty of people who are on a team trying to maintain hundreds of thousands of different containers and instances—they’re no longer servers, certainly. But 100,000 applications running out there, trying to achieve the same goal. And they’re all working as part of microservices architecture. It’s actually a pretty heavy lift after a while to get all of those agents into place, being able to move at least some portion of that burden for observability lower in the stack really simplifies things, I imagine.
Kelsey: Yeah. And I think this is where it’s super key to try to have a philosophy of making it easy to do the right thing. So if you can hook in at the kernel layer and pull some stuff out, great. If you can hook it at the runtime layer and pull some stuff out, do that. But also when it comes to custom metrics, where maybe the developer is required to give a little bit more context, you could pull out from observing the runtime if it has all the symbols compiled in, and if I could pull out what methods are being called, what’s typically missing in a lot of this stuff is metadata about the context. Why am I connecting to this database? Am I going to retrieve a user? Am I going to go retrieve, you know, a new price for something? So that context again is the thing that starts to allow us to get a little bit of value from what this data point is trying to tell us. This pedometer has the context of the speed limit. I know that this gas light tells me how much fuel I have to keep driving. You need a little bit of that context in order to make any of this work at scale.
Jonan: I think you’re absolutely right. And that’s a huge component of observability—that’s why we talk about it that way in the first place, because context is the piece that matters. You have a lot of flashing lights, but which ones are flashing and why they’re relevant, and why the original developer even made them flash in the first place, is the sort of information we’re really after. So I have another question that I wanted to ask, it’s a little bit unrelated to where we are right now, but I hear a lot of people having a lot of strong opinions about multi-cloud, and you are someone who I think has very well-formed opinions. Would you tell us how you feel about multi-cloud and whether or not that’s important for companies to consider as they grow?
Kelsey: When I hear multi-cloud, people are articulating the friction of adopting separate technology islands. So let’s go into a single data center and we’ll flip the light on. We’re walking around this data center and looking at this blade chassis that’s mounted into this 42-inch-wide rack. And the HP blade servers are built by HP, they’re phenomenal pieces of equipment. And on the back of them, they have this standard power connector that goes into this APC power supply. On top of that rack, at the very height of it, you see this 48-, 42-port switch made by Juniper. Then you trace the cable down to the NetApp storage device made by a totally different vendor. At no point in my career, do I ever remember someone calling that multi-vendor? When no one says, “Oh, I got to have a multi-vendor strategy.” No, you buy the best product for your need. And why does this work in data centers? Because most of those devices I talked about, are either connected by a Cat 5 cable, maybe a fiber channel cable, but they’re so standardized, you almost know what to do.
And that latency is so low between those things. Before we get to multi-cloud, I’m going to introduce some friction. If I said, “I’m going to create a replica of this data center and I’m going to move it to the other side of the globe, and all the techniques that you did before no longer work the same way. I can’t just run a piece of fiber—I mean, I guess you could in many ways with these underwater sea cables—but just go with the analogy for a second. You can’t just go to Best Buy, and buy a cable and run it and drop it over the side of the world, you’re going to have to start to deal with some real challenges of the stuff in the middle—the things like latency, or even change the way you structure your data. Right? I can’t rely on low latency access to fast storage. I’m going to have to account for the fact that I’m going to replicate to the other side of the world.
In the case of Google, we have lag-time clocks, so we can actually do this distributed system to compensate for the speed of light and the latency that’s introduced when you’re trying to have a database and keep it in sync with some form of consensus. Given all of that, that level of friction is ever so present in the cloud. So for example, let’s imagine a world where all cloud providers shared one piece of network fabric. Meaning all regions and all zones between all cloud providers have the same amount of latency between each other. We wouldn’t be thinking about multi-cloud in terms of networking the way we do now. We would say, “Oh, I would just put a load balancer and point to the IP in each of the cloud providers, just like I do in the datacenter.” The other friction point when you hear multi-cloud involves authenticating to the different services. Again, you already have this problem on-prem.
When you’re talking about an Oracle database or Postgres, we typically need a set of credentials, maybe it’s a TLS certificate to connect securely and a username and password. Then when you go to connect to a different service, maybe you need a token or a session token, depending. We already have a variety of authentication protocols. We just understand them, I put this in a text file and I put this certificate over here, and the different clients I have will grab their credentials and authentic to a thing. We know how to do that. But the cloud is implemented in one way in Amazon and Google does it another way. They may have a Metadata API that you can call to fetch those credentials. Most people say, “Oh, that’s just so different. There’s no way I can ever do this. Like how can I ever resolve the two?” The thing is, you’ve never resolved this. There was no place where you’ve tried to normalize and we can do this with PAM modules, where you teach PostgreSQL how to use the UNIX database on your local server. Then you can authenticate with a similar set of tokens or user credentials, but the cloud can be treated the same way.
So again, when I hear multi-cloud, what I’m hearing is there’s a new set of friction that I can’t resolve. If I made the network disappear and the latency that goes with it, then it will be very easy for someone to go to Google cloud, spin up a bunch of containers on GKE, Google’s Kubernetes offering, and then point to, let’s say DynamoDB in Amazon, and use that as your backend database, it would just be so obvious because it’s just an IP address that requires a set of credentials. So that is what multi-cloud means to me in terms of when people talk about it. I think that’s what they’re describing.
Jonan: Yeah. I think very often people use terms like this when they’re trying to wrap their heads around the problem that they actually have. But I agree with you that we’re looking at this in the wrong kind of way, that you wouldn’t describe a server rack as being multi-vendor—you shouldn’t describe your platform as being multi-cloud. If there are decisions—technology choices that you’ve made along the way that is leading to friction around that—then address the underlying choices that have been made instead of trying to approach them top-down. Am I approximately summarizing your perspective?
Kelsey: Yeah. So right now, most people look at cloud providers, including their on-prem infrastructure, as these little separate technology islands. Really we need to be talking about is what bridges need to be built across them, and those bridges really look like a form of networking. I think there’s a bit of a fantasy that, “Wow, I should just have this multi-regional replicating data storage and databases.” That’s actually hard [laugh]. You can do it, but you may not necessarily get all providers on the same page about what data storage support and agree to allow replication over the same method that you would choose.
Jonan: Again, as predicted you have fantastic opinions, Kelsey. Thank you so much. I want to maybe have you on the show again in a year and have something to talk about. I wonder if you would predict the future for us—what do you think is going to change in the observability world over the next year? What sorts of things are we going to see popping up?
Kelsey: You know, I think a year is probably not enough time for anything, because I think most people are having to set to the term, whether they disagree with the term or not. I think they both set to the term is an anchor for thought, tools are maturing at a great clip. You have a lot of open source tools that are doing a good job of collecting. We’re still in that collection phase with the tools. And now we’re starting to see some of the other tools that are coming out that are focused on workloads. What does this data mean? A way to annotate that data and have it get you closer to a decision. So great.
I think we’re starting to move into that direction. But then you’re also starting to have a bit from the human side—you might have to become a data scientist to get full value out of what we’re trying to do with observability. So even though we’ve tried to say observability means a thing, honestly, it’s data. This data just like all other forms of data—there are techniques that you can leverage by understanding some of the data sciences. As we start to really approach it from that regard, I think a lot of the vendors are doing this right. Allow them to employ data scientists to say, “Hey, how do we get the most value out of this data so that users can have a dropdown or a widget that they can use?” But I think a lot of times, as we start to mature, that capability will also find its way into the common developer or the common operations person.
So I think that’s what we’re going to continue to see over the year. The next evolution will be when we stop calling it observability. That’s how you know that we’ve moved on to the next thing. I don’t know what that word is, but I know that we tend to adopt words until they’re no longer useful. We create new words that then move the torch forward.
Jonan: I think that that is very likely to be the future for us. And I don’t know if it’s going to happen in a year, but I am inviting you back now a year from now because, at the very least, I want to hear about your book. That is, your project that you’re working on right now, called Mesh The Hard Way. Let’s give our listeners some of the resources in the Kelsey Hightower ecosystem so they can learn some more about your work.
Kelsey: My work these days—and I have lots of code I’ve written on GitHub—this was a project where I was the maintainer. There’s a project for which I was the contributor. I’m currently contributing to open policy agent at this point, which is a great open-source framework for doing authorization. So once you authenticate it, what can people do? But these days, Kelsey’s about learning in public. I don’t claim to know everything, but I do claim to be a person who’s very pragmatic, willing to do the hard work. So it takes me several months to learn a piece of technology. What I promise to people is that I’m going to go super deep, talk to the people who have invented it, maybe even come up with my own ideas, but then I just start to have this habit of learning in public. That means if you follow me on Twitter, then you’ll probably get a daily dose or biweekly dose of what I’m thinking about, how am I approaching problems? Then you’re also going to find stuff on my GitHub: “Hey, here is a tutorial on how to do this thing I just learned, and I want to save you the last 25 hours I spent on this.”
So maybe you can learn it in maybe an hour. And then, of course, there’s going to be things like books, like Kubernetes: Up and Running, and Kubernetes the Hard Way, which is this Creative Commons tutorial on how to build Kubernetes from scratch, but I’m always going to be leveraging these things. Sometimes it’s going to come in the form of a podcast. There’s going to be some keynote that I’ve given that will be on YouTube. So I encourage people to follow Kelsey, the whole person. Beyond the technologists, I have a lot of philosophies on how humans should interact and behave with each other. And you can translate that into your professional lives to also probably become a better technologist.
Jonan: Thank you so much for coming on the show, Kelsey, it has been an absolute pleasure. Do you have any parting words for our listeners who are out there struggling every day in this complex world?
Kelsey: I would just say challenges come and go. A lot of times when looking at a challenge, you can definitely get depressed about it, complain about why it’s so bad. But given the fact that life is considered too short for so many, I look at these challenges as an opportunity to overcome something. So in your professional life, maybe you see this blitz of technology—how can I keep up with all these observability tools, all these patterns, all of these words? And the truth is, at a fundamental level, most of the stuff is the same as it always was. Most of the tools I’m seeing now are no different than most of the tools I saw 10 or 15 years ago. Granted, they may have better UIs. They may have better workflows, but fundamentally it’s roughly the same. And I think you can take comfort in that meaning—be patient, learn what you can, go deep as you can. And more than likely, what you’re learning now will be applicable in the future, and we just call that experience. So I think a lot of people should just make sure that you understand you have control over the pace of information you let in. You have control over the pace of things you choose to adopt, and you can also take a break and it will be OK.
Jonan: I really appreciate that perspective. Thank you again for coming on the show. It has been an absolute pleasure, Kelsey. You have a wonderful day.
Kelsey: Awesome. Thanks for having me.
Jonan: Thank you.
Thank you so much for joining us for another episode of Observy McObservface. This podcast is available on Spotify and iTunes and wherever fine podcasts are sold. Please remember to subscribe so you don’t miss an episode. If you have an idea for a topic or a guest you would like to hear on the show, please reach out to me. My email address is email@example.com. You can also find me on Twitter as @thejonanshow. The show notes for today’s episode along with many other lovely nerdy things are available on developer.newrelic.com. Stop by and check it out. Thank you so much. Have a great day.