We are joined today by Sally Lehman. Sally has spent the past eleven years as a Site Reliability Engineer. She joined Oracle as a part of the Add This team and managed upwards of over 10,000 nodes at the same time.

Sally talks about what she likes about using Kubernetes as opposed to YAML and Salt; where she thinks Kubernetes is currently, and where she sees it going in the future; Prometheus, an open-source monitoring solution; and Thanos, a CNCF incubating project.

Additionally, Sally, who now works in primarily Python, compares this community with other programming language communities around the Web such as Ruby and Go.

Should you find a burning need to share your thoughts or rants about the show, please spray them at devrel@newrelic.com. While you’re going to all the trouble of shipping us some bytes, please consider taking a moment to let us know what you’d like to hear on the show in the future. Despite the all-caps flaming you will receive in response, please know that we are sincerely interested in your feedback; we aim to appease.

Jonan:

Hello and welcome back to Observy McObservface, the observability podcast we let the internet name and got exactly what we deserve. My name is Jonan. I'm on the developer relations team here at New Relic, and I will be back every week with a new guest and the latest in observability news and trends. If you have an idea for a topic you'd like to hear me cover on this show, or perhaps a guest you would like to hear from, maybe you would like to appear as a guest yourself, please reach out. My email address is (jonan@newrelic.com). You can also find me on Twitter as @thejonanshow. We are here to give the people what they want. This is the people's observability podcast. Thank you so much for joining us. Enjoy the show.

I am joined today by my guest, Sally Lehman. How are you this morning, Sally?

Sally:

I'm great. Thanks. How are you?

Jonan:

I'm doing pretty well. I am excited to talk about some nerd things, but I'd have to admit to our listeners why we started a little late this morning. I just went to start recording, with my mouse I reached to push the record button and dumped my entire cup of coffee over my keyboard and mouse and chair and everything. And Sally very graciously agreed to wait for a moment. So coffee errors aside, let's talk about you, Sally. You started out in software some 20 years ago or something, right?

Sally:

20 years. Well, I mean, it seems like that. I mean, I was doing JavaScript with my dad when I was 15, but I didn't do software development until I became a failed electrical engineer when I graduated and it was 2008 and I didn't find what I wanted. So I found a software job and I really loved it. So I kept doing it.

Jonan:

And what was your first software gig?

Sally:

And that was 20, yeah that was 20 years ago. 15 is about 20 years. [chuckles]. My first software job was Mad Mimi, which was an email marketing company that eventually got acquired by GoDaddy. So I was hired originally as technical support but it was really, really difficult for me not to get into the technical stuff. So I weaseled my way into eventually being the manager and deployer of all the email systems for mailing there.

Jonan:

And you eventually ended up at another company with a very memorable cat logo doing mail guru. Right? GitHub.

Sally:

Yeah. That's right. So I worked on notifications there and lots of things with regards to DMARC, DKIM, SPF, routing. It was in the earlier days.

Jonan:

SPF routing is probably a thing that I should know but I have no idea what that is.

Sally:

[chuckles]. SPF and routing, DKIM and SPF, DMARC, these are all things that help validate the email you're getting is a real email and that it is from the person that you expected to get it from and has the content they expected to send to you.

Jonan:

So it's not been intercepted and had something injected. Is this kind of like a hash of the email to make sure that it got to its destination with the origin, it's not signed, right? Because emails were un-encrypted.

Sally:

So SPF is a record on DNS which you compare with what the originating server thinks you are coming from and DKIM is assigning of the message itself.

Jonan:

Okay.

Sally:

And it also has a component in DNS but you're just trying to compare the email with what the originating mail server says should be there. And DMARC is a combination of SPF and DKIM, [chuckles] there's a lot there.

Jonan:

There is a lot there. There is a lot more to sending emails than I previously thought.

Sally:

Yeah.

Jonan:

I understand that it's difficult. I remember trying to write HTML emails back in the day and being mystified that there's like this entire subset of HTML that you can use in emails, but then going down that rabbit hole at my first company, I was at Living Social when I first got out of [inaudible] school, and living social had a lot of email going on and the people we had working on the email problem were intense. There's a lot to making that deliverability number high, making sure that your email is getting people's inboxes. So you did email for a while and you were always interested in the techie things but specifically infrastructure it seems like?

Sally:

Yeah, at GitHub, that's the first time that I really started getting into configuration management, I had done it a little bit at MadMimi, but it was more Ruby scripting and hacking the MVC arc there where it made sense. But with GitHub, I was put on an operations team, I wasn't there doing things all by myself. It was when Puppet was newer but they had put a lot into it. So I got to see it working fairly well and some ways that it didn't work as well. And that's when I got into the configuration management arc. And so now I've been through Puppet, Chef, Ansible, and Salt, and now I'm moving on to Kubernetes as the next step.

Jonan:

I've heard of this Kubernetes stuff.

Sally:

[chuckles].

Jonan:

I hear that everyone needs Kubernetes all the time. It's the answer to all things. I actually quite like Kubernetes, the step between Puppet, so Puppet, I remember being very popular with the Ruby community back in the day, because it was essentially Ruby, and then over time it got kind of its own DSL kind of branch.

Sally:

Right.

Jonan:

It got a little more complicated every day. And then we saw projects like Ansible show up and then Salt, I've not heard of Salt.

Sally:

Salt is Python-based. It is YAML-happy everywhere. You kind of have to know how Salt works on the backend to understand where, like you are pelted by this huge directory of YAML files, you're like, how does this go [chuckles] together? But it definitely can be faster than Puppet or Chef, the way the messaging system works. I've used it a lot. I'm used to it now. I like it.

Jonan:

So we have Puppet, Chef, Ansible, Salt, and Kubernetes as the potential options, all of which are still used by many many people. But many of them have this YAML at the heart, the yet another markup language thing is pretty popular, but since Kubernetes came out exploding, I feel like Kubernetes is YAML on YAML on YAML, but at the same time, they keep adding these layers of obstruction to write all of the YAML for you.

Sally:

Yeah.

Jonan:

So people like to joke about it, but they don't actually have to write it, handroll their YAML files that much, do they?

Sally:

In Kubernetes?

Jonan:

Yeah, in Kubernetes.

Sally:

Oh, yeah. So what I like about Kubernetes as opposed to the YAML and Salt is that files do have labels that you can recognize and some structure that is not entirely customized to your situation, like you're able to define deployment and your metadata, like there are certain things that people expect in your YAML. And yeah the levels of abstraction that Kubernetes is taking with Helm, and then any orchestration you do on top of that with your system, I mean, that's definitely something that I'm getting used to, [chuckles], but I do appreciate there is some labeling system in Kubernetes.

Jonan:

Yeah, the labels confuse me. I was talking to a guest the other day, Bobby Tables, he works at a company called FireHydrant and...

Sally:

[laughs] Bobby Tables huh?

Jonan:

Yes Bobby Tables, his real name is Robert Ross but the community has known him as Bobby Tables for a long time because of the XKCD comic about, well I will explain for some of our listeners maybe haven't seen this comic, you should look it up, it's funny. Mother names her child a SQL injection attack and then gets a phone call from the school, Bobby Tables is in trouble again because he's dropped the whole database. So I was talking to him about the labels thing and we were discussing the fact that the labels seem kind of out of control right now, that there's no standard really for how things get labeled in Kubernetes, you have apt up Kubernetes in some cases, maybe you're not using Kubernetes, there are a bunch of alternate engines that do similar things. And I not being someone who is actually that great on the ops side of the house, when I worked in Heroku, I owned a lot of that stuff, but I don't really, I've never been titled as just DevOps or SRE. So I'm like, Oh, here's a bucket of labels, then I treat it like a tag cloud, and I just start throwing things in there. I'm like, maybe I'll care about this later.

Sally:

[laughs].

Jonan:

Would you think that there is a future where there's some kind of standard around the labeling stuff in Kubernetes? That was what Bobby was suggesting might come to pass or hoping, I guess.

Sally:

I think if there's a concerted effort, that is not the pattern that tends to happen in operations world, it tends to like, you tend to have more entropy rather than less. So, yes, I hope so too, but my whole career has been a series of recognizing patterns and ones that are somewhat related to others that I've seen before. So if a standard comes out, if I can be a part of that too, that's great. But [chuckles] the reality is not what we would hope.

Jonan:

We are now in a time with Kubernetes where it's becoming, it's on this place on the hype curve where it was very very popular and then it doubled in size, the Kubecon conference, doubled in size for three, four years now running and suddenly every company in tech is walking up to their Hofstede's and being like, you know, we need to get some Kubernetes on everything if you could just slap a little on top.

Sally:

[chuckles].

Jonan:

And we are in this place now where the ecosystem is moving so quickly, it's very easy to let the ground slide out from under you and you don't build robust ecosystems so often as you build duplicated technologies. You think that that's going to calm down, there are people who suggest that hype curves, it comes out it's the new hot thing and then it drops down, and then when it comes out of that valley, it gets to be a healthy and stable technology. Where do you think Kubernetes is? And where do you think it is going?

Sally:

Well, given that I tried just about every other configuration management language before I put my feet into Kubernetes [chuckles], I was doubtful at first, it has taken me a while because it seemed to be the new hotness for developers because they didn't want to do infrastructure. But as the operator has matured, as large companies have been running lots of it in production for a very long time, and I see where it glues the components of infrastructure together in a way that is a sort of structure that everyone can adopt and follow and understand, I believe that it is not going to drop off quite that strong. I believe that it will keep going. I think that there will be other systems that may gain energy beside it, but I think it is contributing something to operations and infrastructure that wasn't there before.

Jonan:

I would agree with you. I think it is drawing the two worlds together where you have developers more able to understand what's going on with their infrastructure and to request the specific resources that they need. For example, for their applications, it forces me as a developer when I write my app to think about those things, like how much memory am I going to need? How are we going to handle this process? Instead of, and many people still live in this world, I've actually never worked this way, but where you have your application and you build it, and then you chuck it over the fence for ops team, who's like, Whoa, we've got to keep this nonsense up in production forever?

Sally:

[chuckles].

Jonan:

That doesn't seem fair. Do you think that there's some degree of contention? I don't want to say animosity because I don't feel like there's a lot of animosity. It's not like the Ruby Python fights that I have been acquainted with recently when I go to Python conferences. I don't feel like Ruby has any beef with Python, but Python sometimes has beef with Ruby. I've had some people just turn and walk away from me when I was like, I used to write Ruby and now I write Python. I know, it's heartbreaking. But the developer DevOps thing, it's a bit of thing. Like Ops people are in a situation where they have to maintain this [inaudible] code that some developer threw over the fence and of course they're going to be resentful sometimes.

Sally:

I think Kubernetes, if designed right and if implemented right, can be a solution for that, because ultimately that is not the framework that we can be in anymore. Like, that framework that you described is 10, 15 years ago. And now my job is to provide a platform as a service. And a lot of those things, the deceit, the minutia decisions are actually going to be decided by the developer when they commit their configuration in Kubernetes on my platform. So those sorts of things, if you want to scale a company, have to be moved out to a degree. Ultimately, I know how the infrastructure works and I'm going to help you be there. But as a developer, my job is to give developers the chance to not have to go through me to get what they want.

Jonan:

Yes. A world where you have to go and submit a request to the DevOps team two weeks before you get your server back and you forgot to have something installed and then you go back through the process again, right?

Sally:

[chuckles].

Jonan:

That's the thing that I see very commonly with companies who are late to the cloud. Like we have a bunch of servers in our CoLos around the country, and then, Oh, this cloud thing is catching on, let's get on that bandwagon. But it's similar to putting an agile software development process on top of an existing waterfall structure, where you end up with architecture review boards, where you as a programmer, take your code before the eight-person committee that usually sits on a desk kind of elevated up there and you say, please may I submit this? That kind of approach to building software, I think has proved pretty ineffective. And you and I have both been fortunate I think, to work with companies that are fairly modern processes. You worked at GoDaddy, and then you went to Oracle as a site reliability engineer. I want to hear what that's like inside Oracle a little bit. Like what is the SRE like in Oracle?

Sally:

Well, it's such a large company. There's 120,000 employees, I think.

Jonan:

Oh my God, Larry is going to be able to buy so many swords.

Sally:

[Laughs] So I joined as a part of Add This, which was acquired by Oracle. The main selling point there for me was the scale. I hadn't gone up to 10,000 plus nodes yet. And they were starting OCI, which is a competitor to AWS, they'd hired a bunch of AWS folks and built their own thing. The company, Add This was acquired with three other companies. So I didn't actually get to see the internal workings of Oracle as the old company it is with all of its enterprise, older enterprise products. I was sort of in a cluster of startups in Oracle, which was its own bird [chuckles]. And I learned a lot and I got the scale experience that I wanted. So that was good.

Jonan:

Ten thousand nodes you had to manage at the same time, that's a lot of computer-ing. In this context, because I'm kind of a newbie to this space, we talk about a node and this is very rarely an actual machine, it's a VM on a machine. And of course, the number of those things that we run are all dependent on what we have available for resources. But this is still a significant amount of computing power. Like if you were to express that in like CPUs and RAM, do you have a guess at how much was part of this thing?

Sally:

[chuckles] Header and above? A lot. [laughs]

Jonan:

A lot. Massive. Like you could put a couple of GPT-3s on there for sure. Have you heard of this thing? The AI? There's an AI that wrote a paper to the scientists who were criticizing it about why they were wrong. Like people have been using this AI too.

Sally:

[Laughs].

Jonan:

Very interesting. Okay. So let's talk about Prometheus and Thanos. I originally invited you on here to talk about these two projects, because I feel like they're pretty central to the stuff that's going on with the CNCF certainly right now, Kubernetes being the very popular thing that it is and being picked up by every marketing team in the country suddenly has a lot of oomph behind it. And then we have companies like Prometheus and Thanos, Prometheus being the metrics getter, the application that goes out and gathers the metrics from the individual apps, pulls your individual apps, stores them up internally in a data store that they themselves advise you not to use and then passes them off to some backend where you can put Grafana or whatever visualization layer you want on top. This is my layman's summary of Prometheus. What did I miss?

Sally:

I feel like you got pretty close to what [chuckles] it is there. I do want to say, they're not companies, like they are related companies, but they're open source projects part of CNCF. So there's no Prometheus company and there's no Thanos company.

Jonan:

That is a really good distinction to make and something I want to talk about specifically because I feel a lot of feels. As a developer, I have a lot of opinions and I'm sure that you do too about this particular thing, this open-source thing, where we are in the process right now as we have kind of always been since Red Hat, looking for a way for open source to make money. And a very common model these days seems to produce a product in open source popular project, and then build a company alongside that thing and sell enterprise services and occasionally with an open core type model. I want to hear your thoughts on that. Like, what is your favorite way to make money? That a company is making money with open source? And then what is the one that's going to last? Do you think the open code model is going to be successful?

Sally:

I think ultimately yes because there are always going to be people who do have time to spend not on things that they are directly getting paid for and a few core people who are able to make money off the consultant fees. The collection of everyone being able to see the same code means that it grows and develops the way that it will never happen in an internal company because there's too much turnover everywhere for projects to really develop most of the time to the degree that they could. So I think it's going to stick around. Whether I think it's a healthy model? No, because it tends not to be super great for people who don't have free time to do their own projects. It often does not have some of the protections that people might have within a company that they need to succeed. But it is what is working right now. I hope there are better iterations in the future, but this is what we have right now.

Jonan:

This is the best that we've come up with so far. And I do like your point about the projects that exist in the ecosystem, projects like Prometheus and Thanos, for a single company to have ever built that, is a pretty impossible dream when I think about it. The systems that we have that are that scale, like these huge bits of software written inside of companies, Kubernetes being one of them was originally written inside of Google. When it came out though, Kubernetes was a very different beast than what it is today. A lot of rough edges. And now it is this polished and functional and very useful software project. The part where people are trying to make money off of it often goes terribly wrong.

But Gatsby JS has been a buzz on Twitter this week, there are a lot of people real frustrated with some of the things that have gone on inside that company, quite reasonably, in my opinion. And one of the problems Gatsby is now faced with is if the community of software developers at large abandons your company and your project because they don't like the way you do business or the choices you make, then you're out of business. There's a risk for business owners. If Gatsby can't get the community to continue writing code that they're taking and then selling for money, then their company could die. I feel like that creates this situation where developers have more power than they realize right now, maybe, we get to steer the ship to some degree. So within the context of the CNCF ecosystem I guess, Prometheus and Thanos being projects that you are well familiar with, do you feel like those are being steered well, or are they experiencing undue influence from corporate interests?

Sally:

No, I think they have really good leadership, really committed leadership, really there are several members on the directing board who are very interested in inclusiveness and there are others that have driven the architecture of the project in smart and dedicated, insightful ways. So the corporate is where it's getting its adoption, but it's just because it's a good product. I don't see terrible ads during the conferences like there are no talks that I'm getting frustrated because they said nothing...

Jonan:

That it was a marketing message right? They got up there and did an advertisement. I work in developer advocacy, I am very commonly assumed to be that person, where I have a pretty hard and fast rule with companies I work in where I'm like, I'm not going to talk about your product. If I'm accepted to speak at a conference, if I put a CFP into a conference, that is not for you to send marketing through me, I'll put your logo on my slide that's your hype.

Sally:

Yeah.

Jonan:

So the projects Prometheus and Thanos are well led and have a good technical direction. Prometheus is well-known maybe for a specific issue where they go off the BKIM path a little bit around the YAML files and the injection of variables inside of a YAML file. Like in my Prometheus YAML file, I don't have any kind of hub like interpretive syntax or whatever I want to use to put my YAML, I mean, normally it's just like the bash thing. You put a dollar sign in your little courage and you get up. And there's a long issue that I encountered when I was trying to solve this problem. Because I was learning Docker, I worked at Heroku for a long time, I don't have a lot of Docker knowledge. So I learned pretty recently, like in the last year and I put together my Docker compose file and I have this huge crazy system. Because I only build ridiculous things that are over-complicated. And one of the parts of that is like, okay, put the IP addresses of these instances or these nodes into this file or put the names, the app names is what they would be, but I wanted them to be flexible so I could add new ones and change them. I wanted to inject those into my Prometheus YAML file and there's this long thread on this issue that says, we're not going to do that because sometimes people will then put secrets into environment variables. That's what I'm talking about. I'm trying to inject environment variables. And because someone might put an API key in a variable, we're not going to let them use variables in a YAML file. We voted. The issue is closed, there is no appeal process.

Sally:

[Laughs].

Jonan:

Close thread.

Sally:

Yeah, So you probably encountered one of the people who are more technical rather than community-minded there, but there are both and Prometheus has an interesting, really interesting learning curve. It's a smart learning curve too. So the barrier to getting Prometheus running is low, it's low like Python, but when you get into it, when you have to actually configure it for scaled systems when you have to write your alerts, you have to write your recording rules, then you get into the labels that are, it's meant to be flexible, it's meant to be very well thought out. But I think the documentation is a bit wanting because I think it assumes a level of knowledge of the underlying architecture and assumes that people will go look at the code when they want to understand something, which there's lots of people who are capable of doing that. I think that once you really get into using it in scale, in production, there's some room to improve.

Jonan:

Yes. I think that is a fair statement. I was particularly frustrated with Prometheus in that moment and I said mean things on Twitter about it as I'm wont to do. But, actually, I try not to be mean on Twitter, but I was really frustrated with the fact that every other project I'm working with can do this thing. And I understand the mentality behind that choice, but don't understand the lack of an appeal process for some of the community things around it. But I do appreciate very much that Prometheus is able to operate at an extreme scale because they make decisions, maybe not like that one, but similar decisions. They have an opinion and it allows companies to use these monstrous installations of Prometheus because they've already defined where you're going with the thing. The documentation problem is a real thing. The CNCF projects in general, I feel like are all moving too quickly to keep up with their own documentation.

Jonan:

I have been looking at Prometheus now for many months and don't know PromQL, I rip it off from other people's dashboard of things.

Sally:

[chuckles].

Jonan:

The fact that we have inside of Prometheus like three hand-rolled components, you have the PromQL piece and then you have the text-based metric format, and then there's a separate protocol format between Prometheus and third party adapters on their way to data stores. Like if I put, for example, New Relic as my back data store behind Prometheus, then New Relic writes an adapter for the Prometheus wire format, which is actually a, it's not protobuf, but it's some kind of compressed something, you end up with these projects that have all of these different kinds of, I don't want to say like home-rolled because it makes them sound amateur, but they're good protocols and things. They just are new. It's we're going to start over and do this thing. It's sort of like inventing JSON or YAML again in some cases. Do you think that Prometheus as a project is going to head towards those standards that exist? I know that they're looking at using the protobuf thing, but like open telemetry, for example, are they going to try and move towards that?

Sally:

Well, given that they're part of CNCF, I think so. I don't know the current where the state is, but I would assume so. I see enough collaboration between the developers and leaders of Prometheus and CNCF in general that I think that is going. Where they can have collaboration and integration, I think they will do that.

Jonan:

And the collaboration and integration component, you have described CNCF ecosystem as similar to the Python community. You were, am I mistaken that you have written Ruby and Python?

Sally:

Yes.

Jonan:

And you are now working primarily with Python.

Sally:

Go, but Python.

Jonan:

With Go. Earlier in your career, you were using Ruby and then you were working with Python a lot and now you're working mostly with Go, which is the language of these projects. That's what most of the infrastructure people I know are working with Go these days. Maybe give me a comparison of the communities. I feel like the Ruby community, for me, has always been the most welcoming and open, and Python has as well unless I wear a Ruby shirt, and then Go has been hard for me to relate to sometimes. When I go to a Go conference, it's a different vibe and it may be just that I don't understand it because I'm not deep in the community.

Sally:

Well, my experience in the Ruby community was incredibly warm and incredibly friendly and I just loved that language because of the syntactical sugar and just how open people were, but I did hear that other people had problems. Python, of the three, I think it has the largest, most welcoming community, the language I believe is the most accessible of all of them, and the logic of why it does things is set out very well.

Jonan:

That explicit over implicit promise that Python makes. Python is very attractive to a certain set of programmers. And I like that idea very much. I think that the part where a bit of software that I have written myself is doing things that I didn't tell it to do, that can be frustrating and confusing. With Ruby, I am inclined to say, when you feel like Ruby is magic, you should just read the source code. The part that you're saying is magic is the code that you decided not to read. But Python and then Go have in common maybe that they're very explicit about what is happening where. It can go, for example, you have this, if they're not equal mail that peppers your code, and I really like, speaking of syntactic sugar, I feel like there could be one of these special characters that we use on our keyboards. Like just put a tilde there and say like, if I was a tilde, it means if they're not equal mail, but I want some seductive trigger for those. Most people use IDEs to do that. The ethos of Python, that explicit over implicit, do you feel like that’s Go’s approach as well?

Sally:

Well, your first question was about community. And in general, I do not see the Go community as friendly as the others. And I don't...

Jonan:

Let's go back and answer that...

Sally:

[laughs].

Jonan:

... You were talking about the community and Go,

Sally:

Aha.

Jonan:

I feel like in Go, I have not found that immediate welcome, and it may be that it's a newer community, it may be that I am not a 10-year career veteran of the DevOps field. What is your perception of the Go community versus Python versus Ruby?

Sally:

So the Go community, I think it's problem is it is based in the thinking about architectural decisions and not - the accessibility to outside parties came later and they stuck it on. Especially in the beginnings of Go, like, the communities that were interested in it were people who have been doing this for a very long time. So it's starting to open up, I mean, it has a cute logo and all, but.

[laughs- Jonan and Sally].

Sally:

I don't think that the Go community has the same warmth as the others. Well, I mean, CNCF is I think one of the best examples of where it's doing that better, it still has a ways to go because there's definitely an element of RTFM, read the fucking manual in there. [chuckles]

Jonan:

I agree with you and I think that it might be a function of where people encounter Go in their career paths because I find Go very useful for writing command-line utilities, I'm trying to pipe bio from some process of something into something else that I want, monkey with what's going over the wire, but that by necessity implies a lot of background knowledge about the system. Like I know I operate in an analytics environment, okay I'm sure that you out hacks on me on the command-line, [chuckles]

Sally:

[laughs].

Jonan:

But I do all right and I understand how these systems like Kubernetes work at least in abstract. So then Go becomes more valuable to me as I acquire that knowledge. If I'm just starting out in software and I'm pretty sure you might agree with me, I would not recommend Go as a first language for anybody.

Sally:

Unless you have a lot of emotional intelligence and patience, no I wouldn't.

Jonan:

Yeah. And it's not that the Go is a bad language, it's fantastic, it's very fast, it's very effective at the things that it does, it's in a lot of really useful projects and you should definitely learn it second to Python maybe, or Ruby.

Sally:

Yeah, yeah.

Jonan:

Yeah. I have to throw Ruby in there, I do know that the Python community is significantly larger, but the bit that Ruby got right, in my opinion, was this opinionated approach to the development. We talked about Prometheus having opinions as a project, I like that, I appreciate very much when the people who own a project and go deep. If you were, for example, to write some software, to manage the DKIM pieces of email and whatever exists now, I don't even know, you taught me all of them, or work with email in any way, I would in a heartbeat take Sally's opinions to be the correct approach and not investigate myself. In the same way that as a developer, if you recommended to me a set of headphones, I would just implicitly trust that recommendation. More than any marketing message I hear from any company, I believe Sally has the right ideas and I would trust your opinion. I love that about software projects. I like that people take opinions and they take work off my plate. I don't have to dive deep into all of the things that are involved in these spheres. I let the experts do the work. That opinionated approach has maybe spawned projects like Thanos, that choice of this is the way that this piece should be done. Do you want to talk about Thanos and tell me what Thanos does? Because I feel like I have a pretty poor understanding of what Thanos is.

Sally:

It sounds kind of like you did in that you were saying, here's the remote storage system that they want you to use [laughs] and they don't want you to use [laughs] the one that's built-in with Prometheus. So that's a good summary there. Thanos is a side card that has the backend of your choice of an S3, that allows you to have longterm and remote storage with Prometheus. So in the beginning, when Prometheus decided that it needed some sort of backend, there were two options you had, you had Cortex and you had Thanos. There was this war [laughs] between the two. And I think Thanos won out because the solution of Thanos as a side car on every single Prometheus instance was more distributed of a better approach than Cortex, which had, I mean, it's not like cortex is lost and really what they're trying to integrate is that the two ways of doing storage they're different in how they store information that is sent to them.

Jonan:

The backend. So the S3 part of the Thanos equation is I've got this flat-file store, but Thanos is not designed to replace my database for example, like when I load up my dashboard in Grafana, I'm not expected to be reading directly from Thanos or am I? Is it a long-term, like a backup like Amazon glacier? Or is it a thing that I'm just reading and writing to constantly because it's that fast?

Sally:

I am not sure what the ideal architecture is there, I'm still working that out myself. So when I figure that out, I will tell you. [laughs].

Jonan:

I'm pretty sure no one knows actually at this point. I looked up before this, the diagrams where people have integrated Thanos, and it's kind of all over the place. There is evidence to suggest, in my opinion, that this side card approach rather than having a single point of failure, is the way to go. That horizontal scalability is what makes projects like Cortex and Thanos popular. The ability to go wide instead of them being trapped in a single instance as you grow into the cloud, that is a sure wire phage to make your DevOps team quit the company. The Prometheus stuff that I've seen sometimes, I'm trying to think of the name of the company, it was some dating app, I met someone at a conference and they're like, well, we've got these three Prometheus instances and we just run them on bigger machines with more Ram when they get to the top of their capabilities. But there's no reason that, I mean, it was a huge app, I think it may have been like Tinder, and they've got these three monster Prometheus instances back there, working in tandem and their failover is like, Hey, maybe all three, one go down at the same time. Why would you build it that way? Does it make sense to get [inaudible] [laughs] has many of these as you can? Right?

Sally:

Right, right. And have the backend as Thanos. In version one of Prometheus, the way of scaling was to have the Federation where you have, the lower tier that is scraping on node level and then you have a series of rules that allow you to scrape, in a way sample up to the higher levels what labels you wanted to look at that brought the lower level Prometheus together. But now there's really no reason to be doing that most of the time. And you can use Thanos as that aggregator.

Jonan:

And so Thanos is this aggregator that lets me store all of my data up in S3 or some other flat-file store in the background, but do I, I also use something like the telemetry data platform from New Relic or one of the databases. Uber has a project called, like, M3, one of these things where I'm just putting all of my data and then giving access to the entire company when they need it. I can put all my logs and metrics and everything in one place and this is what supplants the need for a data lake or a data ocean. If anyone out there at your company right now, hears an executive say, I think we need a data ocean, you need to look real hard at the CNCF ecosystem before you make that choice. [chuckles].

Sally:

[laughs] That is definitely true, and I feel really bad for the people with three Prometheus instances because of how Prometheus grabs information. Like, it is so easy to integrate more nodes in and change your architecture, because most of the time you're not going to screw up the rest of the system if you add to it, if you change things, so I, I...

Jonan:

That's the whole point. That's why they have it the way they can. So you could just keep adding pieces.

Sally:

Right. So I hope they change that soon.

Jonan:

I want to make sure at this point since we only have five minutes left if there was anything you wanted to talk about that I didn't get that I'll give you a chance to throw some into the conversation.

Sally:

I guess the thing I would say is don't be afraid of monitoring and do it more. But [laughs] I don't think overall that I have something that I want, that I have a stump speech that I want to say,

Jonan:

All right, well, I'm going to come at it from this angle that there are a lot of people out there, I needed to transition real fast, we were talking about Prometheus and you had said Go you can scale laterally and I had said you should scale horizontally. And if we are talking about the scaling issues that exist in infrastructure are often mirrored in the communities that those infrastructure projects serve, I think that the community is growing too quickly or in the case of the Go community as we discussed earlier, growing from two experience, the base, that you don't have that healthy transition can cause problems for projects later on in their life cycles that you don't see coming.

Sally:

Yeah.

Jonan:

I think it's really important right now for us to get new people into this ecosystem, into the DevOps space. In that vein, what advice would you have for people who are just starting out? Maybe someone is out there learning Go as their first language and I applaud you Go Programmer, continue learning. What can we do to support those people? And what advice would you have for them?

Sally:

The first thing is to go to meetups and go to GitHub and look for repositories of, go enter a word that is interesting to you and search, and you'll probably find some repositories that are interesting to you. And look around at the community, look for, look at a lot of different people in communities and engage with those who are kind to you and accepting of you being perhaps a bit slower to contribute than others might be. And accept that there are going to be some communities that are not welcoming, but there are plenty that are. So you just have to look and eventually you'll find your niche and your people, and you'll be good.

Jonan:

You find your people and stick to them and tell the others to take a hike, I think once you get committed to a project, it's hard sometimes to get that perspective and decide if people are mean I could just leave and go on to the next project and they need you as a developer more than you need their headache, where they want to be jerks and pull requests.

Sally:

Usually, yeah. [laughs].

Jonan:

Usually, yeah. [chuckles]. I mean, it's a big problem in the community at large. I think Ruby and Python and now Go, I want to applaud Go for making progress here. I think they're doing a good job. It's also just a huge community. You walk into a Go conference, it's an unreasonable thing to expect that you're going to make 20 friends on your first day there because you'll have a lot of people talking about things that they're excited about, you've got to inject yourself in that conversation. And open source is a really good way to do that. The things that you are working with right now, Thanos and Prometheus, and other components in Kubernetes, what are you excited about working on now? Like what's up and coming for you or what are you excited to get involved with now?

Sally:

I am so excited for the tooling of logging, tracing, metrics, and alerting coming together in integrated systems. So the stack that CNCF has is where I see that is most accessible for me right now, but I hope that there are going to be more stacks that are similar. I hope that the maturity of Loki, Jaeger, Prometheus, alert manager, all those things are able to, and I hope I mentioned Grafana, you are able to seamlessly transition between them to the point where it seems almost from the user perspective as if it were one app. Because as a person waking up in the middle of the night, a lot of my energy is put into trying to recognize patterns between disparate systems. And so where I want to see observability going is assisting you a lot more in time correlation, in pattern recognition, in intelligence sampling, because once if you integrate all those systems, then you have a lot of information that you have to put together. So that's where I see things are going and I'm really hopeful and really excited about what that can look like in a couple of years.

Jonan:

I don't think I've heard someone express from that perspective where you said to make it appear as though you're working with a single application. So you have a service-oriented architecture or a microservices architecture with a hundred instances of some rails app running. And from a debugging perspective, it looks like one thing, which it should. That is a beautiful dream and I'm really looking forward to that as well. I think there are some pieces there that have a ways to go, but I think we're close. And once we have it, you're absolutely right, we get to start building intelligence on top of that and then I will no longer get woken up by a system that's not actually failing, it's just running a little hot. And it won't take me an hour and a half to figure out that it is just running a little hot.

Sally:

[laughs] Right. Yeah, yeah.

Jonan:

You have a beautiful dream there.

Sally:

[laughs] There are products like Honeycomb that I think are on the edge of that, but they also, the open-source component there is not that I know of there. So that's why I say, I just hope there are more products, I hope there are just more people thinking about it in that way.

Jonan:

I agree. I have run out of time with you. I've really enjoyed our conversation though Sally. Thank you so much for coming on the show. I want to give you one opportunity then to make a prediction for the future. Do you think that there is going to be an open-source applied intelligence application maybe? Or you had hinted at one of these, but make a prediction that I can hold against you a year from now, when we come back on the show and I could be like, Sally you were, well, I have a feeling somehow that you're going to be right, but give us a hint, make a prediction.

Sally:

I don't think that artificial intelligence, like, is going to be mature by that time [laughs] but I do see more integrations between the CNCF telemetry products. And I think that you will be able to at least transition between them, all of them, in the next couple of years by links and such.

Jonan:

I like that very much. I will probably have a hard time proving you wrong on that one because I agree a hundred percent. That's the best part about the CNCF growing closer together. If people wanted to find you on the internet to listen to your brilliant thoughts about computer-y things, where would they find you?

Sally:

Thanks. My username is usually “rothsa”, R-O-T-H-S-A. So rothsa is who I am on GitHub and on Twitter, I'm @SllyLhmn. Sally Lehman without vowels.

Jonan:

Thank you very much for coming on the show. I appreciate your time very much. Yeah, it was a pleasure. You have a wonderful day.

Sally:

You too.

Jonan:

Thank you so much for joining us for another episode of Observy McObservface. This podcast is available on Spotify and iTunes and wherever fine podcasts are sold. Please remember to subscribe so you don't miss an episode. If you have an idea for a topic or a guest you would like to hear on the show, please reach out to me. My email address is jonan@newrelic.com. You can also find me on Twitter as @thejonanshow. The show notes for today's episode along with many other lovely nerdy things are available on developer.newrelic.com. Stop by and check it out. Follow us on the Twitters: @ObservyMcObserv. Thank you so much. Have a great day.