In this episode of the New Relic Modern Software Podcast, we talk with Stephen Weinberg, director of site reliability engineering at investment research firm Morningstar.
Stephen joins me and my co-host, New Relic Developer Evangelist Tori Wieldt, in a wide-ranging discussion that covers on the extreme care required to bring SRE to the financial services environment, how to grab some big wins, and hard-won lessons learned.
You can listen to the episode below, get all the episodes by subscribing to the New Relic Modern Software Podcast on iTunes, or read on for a transcript of our conversation, edited for clarity:
New Relic was the host of the attached forum presented in the embedded podcast. However, the content and views expressed are those of the participants and do not necessarily reflect the views of New Relic. By hosting the podcast, New Relic does not necessarily adopt, guarantee, approve or endorse the information, views or products referenced therein.
Fredric Paul: Stephen, for listeners who may not be familiar with Morningstar, can you give us a quick description of the company?
Stephen Weinberg: I like to say Morningstar is a 33-year-old startup. We were founded out of Joe Mansueto’s apartment here in Chicago. He saw a need for transparency in the financial services industry, especially around mutual funds. Since then, we’ve grown, put out lots of different products, used software and technology in many different ways. But at the heart of it we’re still an investment research company, and the biggest thing that we want is to put investors first. The idea is that no matter what size investor, whether it’s just in your 401(k) plan for work or whether you’re a larger investor or even an institution serving investors, you should have clear information and unbiased research.
Fredric: The idea of Morningstar as a 33-year-old startup plays into the whole idea of site reliability engineering, I think, in that it’s a relatively new concept. Can you talk about how Morningstar defines site reliability engineering?
Stephen: I have to confess that I went to Wikipedia to look up the standard definitions for these things. I’ve heard them used so often, and sometimes it makes absolute sense and sometimes I’m very confused how people are talking about it.
So, citing Wikipedia, DevOps was defined as a set of practices intended to reduce the time between committing a change to a system and the change being placed into normal production while ensuring high quality. So really, DevOps was about speed.
SRE, to me, is a framework for how to do DevOps, but the focus with SRE is on the quality portion of that. The idea is that if you have quality and if you plan for scalability, if you plan for efficiency and resiliency, then speed is a natural byproduct of that. Even Facebook has moved away from its old slogan of “move fast and break things.” That idea of quality has become a key component in a lot of different areas.
For us, the biggest piece in the SRE puzzle, philosophically, has been the idea of creating feedback loops. How do people know that the product is working? How do people know that they’re improving the product? How do we know we’re moving in the right direction? And keeping that as simple as possible.
Fredric: Given all that, what does the SRE practice look like at Morningstar? Do you move fast and not break things?
Stephen: We’ve been focusing on transparency, just like when the company was founded the idea was transparency around financial investments. How do you know you’re investing in the right thing? How do you know your investment will do well? Is it going in the right direction?
Now it’s applying that in terms of software, the idea of how do we know that our software (as we add features) is not becoming more chatty, is not making unnecessary calls? That we’re not trying to call the database for static data? That we’re really taking the steps to make it so that success will not be a danger for our systems? That when we scale from 10 users to 10,000 users, it is an exciting event as opposed to everyone freaking out and trying the best they can to keep the system up?
So again, that idea of transparency, building in the monitoring, building in systems observability, figuring out how to understand how our systems are talking to each other, how are we talking to our customers about our systems.
Fredric: That’s a fairly advanced way to look at things. How did you get there?
Stephen: I came to the operation side of software development fairly recently. I cut my teeth as a database administrator and database developer. To some degree, I view the database side as one of the least developed operational models. A lot of the ideas around unit testing and mocking that have been really heavily implemented in the software engineering field lagged behind on the database side. So, as I moved on from an individual contributor role on the database side to leading software development teams, I noticed that the database was a real pain point.
As that transition was happening, we were talking a lot about DevOps here at Morningstar. It was around 2015 and Ben Treynor’s video for SRE that he had done in 2014 for SREcon was out on YouTube, and a bunch of us had watched it and started discussing it.
I was really impressed with the idea of SLIs and SLOs—service level indicators and service level objectives—the idea of an error budget, and describing a system’s health in objective ways that weren’t judgmental.
It didn’t mean that we had made bad software, it just clearly defined what we wanted and whether or not we were achieving it. The requirements in order to successfully implement these ideas around transparency of the system, designing tools, and an operations mentality that allowed us to measure accurately and consistently were really exciting.
Tori Wieldt: That’s cool! Tell us about some of the results you’ve gotten from that.
Stephen: The biggest thing I’ve come to realize is that incidents are my friends. When our systems go down, that’s the moment we get to learn the most. So we’ve really been looking at our incident management and problem management process and have gotten the best results there. We’ve reduced the duration of incidents by about 40% between 2016 and 2017, and we’ve reduced the frequency of incidents by about 25%.
Tori: Is this just in your organization or company-wide? Do you have plans to take it from your org out to the larger organization?
Stephen: The results around incidents are actually company-wide. It’s for our top products and all of their dependencies. Some areas obviously have more legacy software and are more difficult to implement these changes on, but this is a broad-based initiative from the company perspective around incident and problem management. We are evangelizing within the company and we are setting up guilds, different groups working together and meeting on a regular basis, sharing ideas, discussing what’s worked, what hasn’t, and what challenges we’re facing.
Whether or not we’ve all been labeled as SREs, we’ve self-identified as the ones in a technical operations role and are meeting together to describe what challenges we’re facing, and using the ideas of SRE as a launchpad for where we want to go and what we want to do.
Tori: So, can you share some of the successes and obstacles you discuss in your guilds?
Stephen: A big part is how do we get recognition for our teams and the work that they do? Historically, operations was not necessarily the first thought when developing products; the first thought tended to be feature development. With operations being a second or third thought, the teams that are working on it are often fighting the push for more and better features as opposed to being able to really advocate from the users’ perspective that the most important feature is uptime.
A lot of what I’ve been working on with my team is how do we get into the middle of that so that we develop tools that are directly usable by the customer support staff? That we get involved in the disaster recovery and business continuity efforts, that we get involved in incident management and the problem management for the root cause analyses, we get deeply involved in the RCA process itself, and hopefully we then build out tools for the developers as well.
I’ve stolen some of Netflix’s ideas around arguing from first principles and having a philosophical approach to this. Looking at the principle of reliability from the standpoint that one server going down should never have a customer-facing impact, alerts should never be ambiguous, client issues and questions should be resolved within the phone call for the client, and development best practices should be easier to implement than bad practices. We all have different ways of getting there, but we can easily agree that each of these are good ideas.
Fredric: Given that approach, what have been your biggest wins?
Stephen: To me, the biggest wins are around incident frequency and incident duration, because that’s what’s most meaningful to customers.
Additionally, for one team, we built out a tool that supports multi-factor authentication for login. It’s challenging for a lot of customers to get used to that, and they were often confused whether or not they had received a token, were they having problems with the login itself … what was happening? It was generating a fair number of support calls. The problem was that the support staff didn’t have the tools or the visibility into the system to do more than log a ticket for the developers to dig into the Splunk logs and actually figure out what had gone wrong and why.
So, we built out a tool that used the APIs to give access to specific parts of the log to the customer support staff, so that while on the phone they could handle a lot of these questions. The goal was to reduce the number of support tickets for developers by 80%. That’s an example of how we empower both customer support to do their jobs and developers to focus on their jobs.
Fredric: So, have you been successful at reducing those customer tickets?
Fredric: By 80%?
Stephen: I don’t have the exact facts, but it has been successful enough that additional teams have asked to roll out the same software. It’s catching on.
Tori: That sounds like a more holistic view, which is great, right?
Fredric: Are there any other statistics you could quote in terms of successes out of this?
Stephen: I don’t have any other statistics, but anecdotally there have been a number of small incidents, the type that often nag for a long time. They eat up a lot of the developer’s time to troubleshoot. Areas where it’s an intermittent bug that QA can’t quite reproduce. Our SRE team, to give a plug to New Relic, has been able to dig in and figure out where the issue is happening and figure out enough of the details around it to help the developers figure out where the code needs to be changed—for either a hotfix if it’s really important, or just for the next release.
Tori: Thanks for the plug for monitoring. For some reason we’re big fans of that. Tell me some of your learnings, your best practices advice for other companies because I know there’s a real hunger for that.
Stephen: I think there are dangers of getting too specific with best practices because DevOps and SRE are so widely defined. But from a broad standpoint, I think about this in terms of feedback loops.
Our company has the goal of hiring good people, and we have good people working here, and they have the best of intentions. There will be times where their job is challenging and they won’t know if they have made the right decision or not. But if we have an SRE team that develops tools that helps them to know quickly when a decision is suboptimal, and can help point them in the direction of what the best result is and how to get there, it’ll really improve the whole process. It will improve the speed of software development, improve the speed of skill acquisition, and improve communication across teams.
From a development standpoint, the architecture and road that we want developers to go down should be much easier than the road we don’t want them to go down. Often, there’s so many gateways and hoops for people to jump through to get the right sort of software or box or permissions spun up, the right firewall rules, that they start creating their own processes. If we can make the road we want them to use simple and easy—and the road we don’t want them to use really difficult and challenging—people will go the way we want them to go. Ideally, that is the road that will actually be good for the company as well.
Tori: I think a lot of companies are based on that whole idea, right?
Fredric: Well, it makes a lot of sense. Stephen, what’s the role of monitoring in making these things happen. How do you leverage your monitoring tools, and obviously New Relic as well, to get to that place?
Stephen: Monitoring is the information about the health of the system. It is all of the data. And the alerts are there to let us know when something has happened within that data that needs an action.
The browser-based tools give us insight into the client experience, and can help us differentiate misunderstandings of how we expect the client to be using the system, what environment we expect them to be using, and what they actually are using—and if we’re having bad outcomes because of that.
Then for APM it’s really how is our system using it, how is our system working? We use the different release flags so we can compare release to release to see if response times are increasing. We’re using them to see if the application is becoming more chatty. We’re using a bunch of these different features to both understand our development practices and the directionality of the system as well as when there’s an immediate need for us to intervene.
Fredric: Finally, I can’t let you go, Stephen, without asking one more question. You guys are in the financial services industry, which is heavily regulated, and yet think of yourselves as a 33-year-old startup—and you’re using things like SRE in this environment. How does that all come together?
Stephen: Very carefully.
That’s the pat answer. There are a lot of times when what we want to do and what we can do don’t line up. There is a lot of nervousness about private financial data being stored in the public cloud, so our speed to move to cloud providers is something that we need to be very careful with. That then limits the ease of using a lot of things, such as chaos engineering, which is a big topic that we talk about in our standups. We want to play more with it, because it’s exciting and fun for us, but it is something we need to be very patient with as opposed to just being able to jump in and go full board into it.
Additionally, there are browser requirements that our clients often have. There are some clients where the institution they work for mandates that they have to use IE. We have to make sure that our clients’ needs comes first and that we really think from our users’ perspective.
Fredric: I think that’s true for everyone but in a much more regimented way, probably, for you.
Stephen: Yes. There are more restrictions on our clients, and because we’ve been around long enough and because some of our products have been relatively successful, that success creates a set of guardrails for where we can go and what we can do without causing too much disruption to ourselves.
Note: The intro music for the Modern Software Podcast is courtesy of Audionautix.