Site Reliability Engineer, or SRE, is a hot job role in many technology companies these days, and Google is known for its laudable reliability. In an effort to share some of the secrets behind that success, we are thrilled to welcome Liz Fong-Jones, Google SRE Manager, to join our own Matthew Flaming, vice president of engineering at New Relic, in a discussion of best practices for site reliability at FutureStack: New York, on September 13-14.
To give you a taste of what’s in store during this informative session dubbed “Between Two SREs: An Inside View of Google’s Reliability Practices,” we asked Liz about how she became an SRE and the themes she plans to address in her fireside chat on stage with Matthew (which will be held on Day 2 of the conference, Sept. 14).
New Relic: What’s your professional background? How did you become an engineering leader at Google?
Liz Fong-Jones: I cut my teeth as a systems engineer managing dozens of physical machines for an educational nonprofit, my college’s student-run compute cluster, and an indie game studio. Upon becoming a Google SRE in 2008, I rapidly had to adapt to being responsible for the scalability, maintainability, and performance of hundreds of compute jobs running across thousands of containers (or more!).
I learned how to engineer for observability at scale, how to write robust automation, and how to lead incidents when systems go haywire. I’ve worked on eight teams over the course of my nearly 10 years at Google, taking on progressively more responsibility and technical leadership on each team.
New Relic: What does your current role entail?
Liz Fong-Jones: Today, I’m a Staff Site Reliability Engineer on the Customer Reliability Engineering team for Google Cloud Platform. My role is to educate current and future GCP customers about how to build their teams and applications for reliability. I work directly with large GCP customers to ensure that the products they entrust to our platform follow best practices for reliability.
In an ideal world, if a customer’s application that we’ve reviewed for best practices is having trouble, rather than having to wait for the customer’s alerts to fire and for them to file a support ticket, the right engineering team at GCP is paged automatically based on the symptoms observed from the customer’s side. But it’s a lot of work to get to that point, and our goal is to meet people where they are and help them along the SRE journey as far as they’re willing to go.
New Relic: Can you tell us something about what you plan to cover on stage at FutureStack: New York?
Liz Fong-Jones: In our fireside chat, I’m hoping to share some lessons I’ve learned from teams I’ve been on, and how good design, operational, and organizational practices can make your life better and your service more stable.
New Relic: FutureStack will have hundreds of technical practitioners and business leaders in the audience. What are the big takeaways you want them to walk away with from your session?
Liz Fong-Jones: It’s hard to capture all of SRE culture into a single talk, but I want to give people at least a taste of the basics: (1) write meaningful SLOs/error budgets; (2) alert on symptoms of user pain, not potential causes; and (3) keep limits on your operational load.
New Relic: What technologies and big ideas are you focusing on going forward?
Liz Fong-Jones: I’m particularly excited about helping teams navigate the transition from on-premise to cloud-native applications, and building both the skills and technology needed for engineers to effectively troubleshoot problems in our glorious new cloud and microservice-y world.
New Relic: What are you personally most looking forward to at FutureStack: New York?
Liz Fong-Jones: I’m looking forward to talking to people who are excited about observability and about maturing the operational practices of their teams.
Join us in New York!
Watch a video preview of our FutureStack: New York event below:
Note: Event dates, speakers, and schedules are subject to change without notice.