At New Relic, the Engineering Operations team (EOPS for short) is tasked with aligning our engineering efforts, supplying mission critical data by instrumenting our development process, and providing internal tools to keep everything running smoothly in the New Relic Product organization. EOPS is a small team—just two developers and a manager—so we use our limited development resources as wisely as possible to satisfy the needs of a company that continually strives to be more data driven.
The decisions we make about whether to build, buy, or customize tooling require a deep understanding of the problems we’re trying to solve for the engineering teams we support. In this post, I’ll explain how we make such decisions, and I’ll share examples of some decisions we’ve made about buying, building, or customizing existing tools.
Build vs. buy: It’s not binary
We like to think of the decision to build or buy software as a sliding scale. In some companies, EOPS teams exclusively purchase publicly available services and applications and use them exactly as intended; these teams are clearly on the “buy” end of the scale. An internal tools team that develops all of its business solutions in house by creating new services for every need is, obviously, on the “build” end.
We exist on the scale somewhere in between those points.
We’ve found that a balanced approach can provide many benefits, and we carefully examine the pros and cons of solving problems at various points on the scale. This flexibility allows us to build custom services (or rely on existing engineering infrastructure) for esoteric needs and to purchase third-party services for more ubiquitous needs, the latter of which we’ll often customize.
Buy and customize: Perfecting our Jira integration
One of our first tasks after Project Upscale realigned our engineering organization was to align New Relic’s engineering teams on one platform for planning and tracking work. Some teams had been tracking their tasks, stories, and bugs with Pivotal Tracker; some used GitHub issues; and others relied on Jira. Since Jira has a great API for incorporating data from other platforms, we standardized the entire Product organization on it. In this case, we realized that customizing an existing (and already highly used) service such as Jira to meet our new needs was a much better fit than attempting to build something new. We added some custom fields in the existing Jira ticket template and created a custom workflow to pull the data from tools other teams were using into the appropriate Jira project. With all planning and work tracking data now in one tool, management and other Product Organization leadership had much better visibility into their teams.
Customizing existing tools: Building the Andon Slackbot
Previously, New Relic engineering teams used the Andon system for tracking the health of production pipelines. Based on part of the Japanese Kanban manufacturing process, the Andon notification system uses red, yellow, or green lights to indicate if a given part of the production line is running smoothly. If a section’s light changes from green to yellow or to red, the appropriate people notice. Extending this manufacturing concept to software production was a great idea, but the original implementation did not meet our needs.
Critical limitations included:
- Limited visibility and alerting on Andon status changes. Since the information was buried in the depths of a wiki and there were almost 50 teams deploying services to a production pipeline, stakeholders had to watch 50 pages to be alerted of changes.
- The combined history of these changes was buried in the individual histories of each wiki page, so tracking down historical issues was quite a daunting process.
We needed an interface that would give our users quicker and more immediate visibility into any status changes in the production pipeline. Since everyone at New Relic uses Slack, we built a Slackbot, backed by Node.js. The payoff was immediate. After we launched our prototype, users piled into the Andon Slack channel where they could easily see status changes in real time. We also refined the visual design to make it more useful for red-green colorblind users and added a lightweight
!andon command for easy access to lists and statuses.
Because of this quick visibility, our engineers are now more eager to help investigate or solve pipeline issues, and they make sure to keep information about their teams up-to-date.
To help us combat the limitations of the original implementation of this tool, we made two key integrations with New Relic Insights:
- When users interact with the Andon Slackbot, events are sent to Insights. This helps the EOPS team debug potential interaction problems and analyze usage patterns.
- Engineers can easily access historical data and view trends for any issues in their purview—no more wading through the history of a wiki page.
Leveraging existing engineering infrastructure to build our services
When we need to build a new service from scratch, we leverage the outstanding work done by other teams at New Relic. Using existing engineering infrastructure, EOPS can spin up new, resilient services in a fraction of the time it used to take, and we’re able to follow our own engineering best practices for UX and security while doing it.
When we need to deploy a new service, we use our in-house deployment system, which makes it easy to get multiple load-balanced instances for our application container. We’re able to specify how much memory and CPU we need for each instance, and deploying a build for various environments is as simple as building and pushing up a new Docker container for the system to deploy.
When we need to build a web interface, we use the common shared React-based UI components that make up much of New Relic interface. We want the services we build for our internal users to look as if they were another part of the New Relic product.
When we need a database for a new service, we rely on the DB Engineering team to provide us a MySQL or PostgreSQL instance. The DB team handles the provisioning and maintenance of the database, and we don’t have to worry about the replication, recovery process, or other concerns of running a database instance ourselves.
Using all this available infrastructure, we’ve built an API and UI for a service we call Team Store:
Team Store makes it possible for an internal service such as the Andon Slackbot to know what engineering teams exist in the organization, as well as which GitHub repos they own and which Jira projects belong to them. Management, leadership, and other stakeholders should never have to guess what each team owns or what their project statuses are.
We also recently used New Relic engineering infrastructure to deploy an application called Job Runner (built with a job processing tool called Sidekiq) that sends data to New Relic Insights for reporting and analytics and to Amazon Simple Storage Services (S3) for backup and recovery. For example, we recently began running a job against the Team Store service that collects the current number of open Jira bugs for all teams.
Finally, our Webhook Service handles events from webhooks and then performs a specified action. For example, if a developer creates a new bug ticket in Jira, we send that event data to our internal email, Slack, and Insights for the proper report handling and analytics.
Improving efficiency whether we build or buy
When we take the time to interact with our stakeholders, and are deliberate with our build vs. buy decisions, we’ve found that our outcomes have been very positive and generally well received by the teams we support. We’re deeply averse to adding additional friction to our users’ workloads, but sometimes we require a new process as we constantly work to collect and expose data that the Product organization (and New Relic as a whole) finds useful.
The New Relic EOPS team may be small (for now), but we handle and maintain a diverse ecosystem of services—either built from scratch or purchased and customized—that shapes process and empowers our Product organization to gather all the data about how we build and ship software. We think this balanced approach, applied on a case-by-case basis, is the best way to get the most powerful and appropriate set of tools to make us as efficient as possible.