How New Relic Used Docker to Solve Thorny Deployment Issues

By Posted in New Relic News, Tech Topics 12 August 2014

Karl Matthias, who manages New Relic’s Site Services team, also contributed to this post.

With all the press Docker is getting these days, there’s been a lot of discussion about theoretical use cases. But New Relic has already been enjoying real-world success using Docker in production for months, using it to solve a single, thorny problem: deployment. This post, based on a presentation we gave at DockerCon 2014 called Docker Deployments: Powerful for Developers, Painless for Ops, explains the deployment problems we faced; how we addressed them using Docker, our open source Centurion deployment tool, other tools and best practices, and how the simple strategy we used has improved the lives of developers and ops people alike at New Relic.

Title Slide

 

We’ve seen other companies build a Platform-as-a-Service on top of Docker right up front. But, we didn’t set out to build a PaaS right away; we set out to improve the lives of both developers and ops teams. Along the way we followed the New Relic philosophy (courtesy of Ward Cunningham) of doing “the simplest thing that could possibly work.” Here’s what we did and how we got there, along with some explanation of the tools we built to direct our fleet.

Our story really starts with preparations for the March launch of the open beta for New Relic Insights. Initial investigations by the Insights team showed that Docker offered potentially huge benefits for development teams like them. So after promising early tests, we made a hard push to get Docker into production.

This turned out to be a remarkably quick ride: We went from no Docker in production in December to launching our biggest new product ever on it in February. We are now launching several new services into it every month.

Our situation

We started with a fairly standardized app environments with about seven production apps in two languages. These were large, monolithic apps where major changes affecting the environment very rarely happened. The system was maintainable with tools we had already built and configurations slowly evolved over six years of production life. We ran almost everything on hardware or hefty virtual machines–often running more than one app or service per host. We used a centralized load balancing on a tier of F5s. It’s a classic production environment, much like you might find at many other companies of our size and maturity.

A concerted effort had enabled development teams to handle their own operations, which was great, but we couldn’t give these DevOps folks access to all the things they needed. Additionally, when changes were needed from the Site Engineering team (us), we found that asking developers to pull requests against our massive Puppet codebase was awkward and often detrimental for both parties. Worse, because they needed access to a lot of the configuration to debug their applications, many developers had potential access to all of the database secrets.

Even for those mature codebases, when the environment needed to change, deployments would often break. It wasn’t clear who was responsible without Site Engineering spending a lot of time debugging the issue. Fixing anything took a lot longer than it should have, and systemic problems were very difficult to isolate because so many apps were sharing resources–and many apps on the same servers had conflicting requirements.

Things began to evolve rapidly toward the end of 2013. We began breaking our main app into a service-oriented architecture (SOA) with many services, all of which needed new environments. Additionally, our apps were becoming increasingly heterogeneous. We had at least four Ruby versions, three JVMs, five data stores, a large number of new database instances, and a whole slew of other variations. Different apps updated dependencies at different times.

Left to deal with all the issues were the one or two people on each team who knew the magic of their deployment configuration. Worse, the deployment experts on the existing teams didn’t stretch to all the new teams. Some teams found themselves without anyone who knew anything about deployment. Add the fact that launching Insights would boost our server count by 50% and something had to change.

What we didn’t do

At this point there were a lot of paths we could have taken. But we wanted to do the simplest thing that would deliver the biggest win for the least pain–for Site Engineering and for our developers.

So we didn’t try to create a full PaaS framework all at once. Though this may be our eventual goal, it wouldn’t have solved the immediate deployment problem.

We did not begin Dockerizing our applications by starting with those that have the highest data volume. Rather, we started with our simplest internal Web apps, particularly stateless things that could scale horizontally. Our early testing showed that high throughput apps are not a good choice for your first Docker deployment, due to the Docker network stack. Release 1.0 of Docker brought reliable host-based networking and that may have changed this situation–we’re investigating that now. Applications where each instance must stay up for a long time are not a good first choice, either.

We didn’t implement dynamic scaling or service discovery. We decided to assign ports statically to each application, using a port registry. This let us pre-configure load balancers for an application across a pool of servers and let health checks determine where it is being served. It also let us easily preconfigure our monitoring applications.

Implementation: What we did

The early projects to ship on Docker at New Relic built their own Dockerfiles. The Insights team laid down some best practices that other teams followed and that we’ve now encoded in our base images. The whole exploration process was facilitated by the Insights team’s Unix knowledge and general determination.

Our Site Engineering team, meanwhile, focused on the deployment part of the story, developing a tool that would let us guarantee runtime configurations for containers and do automated deployments using Docker: Centurion. We’ll talk more about that below.

We used Docker to separate builds from deployments, a strategy that aligns well with Docker’s architecture: Build jobs ship things to a Docker registry, and deployments pull images from a registry and execute them onto servers. Docker is a powerful piece of technology and as such there are plenty of things to learn when starting to use it.  We didn’t want to require teams to have a deep understanding of building and shipping images with Docker in order to deploy. The idea was to lower the bar and we felt that that alone might be a major obstacle. But we also wanted to let teams retain the full power of Docker if they needed it. We wrote some tools and developed a strategy for handling that.

Lowering the bar: Think high jump, not Limbo

Through the early process, we discovered some best practices for images, and were able to take advantage of Docker’s layers to create some easy-to-use base layers for teams to build on. This was the first step toward making things easier on development teams. Follow the standards and it’s easy to package your app in Docker. It also makes upgrading software easier: We update the base image and teams can pick it up on the next deployment. Changing OS dependencies synchronously with deployment is a huge win.

After we got things deploying correctly using Centurion, we made other tools for automated Dockerfile configuration. Called Shipright, the tool lets us get new projects up and running in hours, not weeks. Teams don’t necessarily need to know how to write a Dockerfile, they can simply specify configuration items on the command line and the base images work with Shipright to build a container image. Often teams don’t need to do anything to build an image other than check out their own repository and run Shipright. Shipright and Centurion let us completely separate developer and ops concerns, clearly delineating responsibility and gettings teams up and running very quickly.

We recently had an opportunity to measure our success when a team breaking out legacy code into a new service had their app running in hours. This would have been a real challenge using previous mechanisms.

Getting to know Centurion

Back to our tools! Centurion is a command-line application built around Rake that lets you ship Docker containers to whole fleets of machines, with a repeatable configuration. It’s also environment aware, meaning you can set all of your application configs in one place for every environment.

Getting to Know Centurion

Centurion handles port mapping, volume mapping, and runtime environment variables (like database secrets or environment specific settings). It has a place to choose the servers you are deploying to, and which environment they use.

Another big win is that it supports rolling deployment of Web applications out of the box. It takes advantage of a health-check URL for each app (we use /check/status) to know definitively when a new container has successfully started and is ready for service.

We’re already using Centurion for deploying real applications to production. But its flexible design lets it work with environments as small as boot2docker on a laptop, so devs can ensure that everything is the same from laptop through staging and into production.

Other New Relic tools

If it isn’t monitored it isn’t in production. We monitor everything at New Relic, and Docker is no exception. Of course we monitor our applications and servers with New Relic APM, but we also use Nagios for more systems-level health checking. Our check_docker Nagios check for making sure Docker is running well is also open source and available.

Of course, New Relic APM works with Docker out of the box because it doesn’t need to know where it is running, just how well it is running.

Operating our environment

Our deployment configuration currently sits in a Git repository, shared by all the applications. We’re working on etcd support as well.

Because it is just a command-line tool, Centurion was able to drop into our Jenkins environment and immediately allow continuous deployments. We have two kinds of jobs: Shipright-based jobs that build images, and Centurion-based jobs that deploy images to servers.

Getting a new application up and running in Docker is pretty easy. But we built some additional support for debugging. Centurion can give you a running console with all the same configuration items and the same base image. Runtime debugging is possible through SSH running in each container, via centralized logging (e.g., via Papertrail), and via the Docker ‘logs’ command. We are moving to pushing everything through syslog into a log router (Mozilla Heka) and into ElasticSearch/Kibana for consuming them.

In this new environment, teams can iterate on their dependencies whenever they like. One team upgraded its Ruby and Rails versions without having to inform us. Security patches are also smoother sailing. They can be easily applied in the next container build, tested thoroughly, and then pushed to production without requiring any work by development teams to synchronize dependency changes with deployment: It’s synchronous by design.

It’s easy to add capacity by adding Docker servers. To do that, we just network boot machines from Cobbler and, when the new OS comes up, Puppet installs the Docker daemon.

Happy developers and a bright future

Our development teams are hugely enthusiastic about Docker. Being able to iterate on their own dependencies, launch a local integration environment easily, and never being required to touch Capistrano are all big wins.

Developers Love It

We have vastly lowered the bar for new Site Engineering projects. We simply provision a database with Puppet, slap an entry into the load balancers and DNS, and let the team add its data to Centurion (we still tell them which hosts to use). Site Engineering spends a lot less time digging around in Puppet trying to figure out how to support conflicting dependencies.

There is a lot of work left to do. But we have a clear vision of where we want to go and the next steps to get there. We expect to support etcd for both service discovery and service status soon. We expect dynamic configuration of load balancers and Nagios from etcd as well. Eventually, we’ll circle back and begin configuring high-volume apps and databases from Centurion using Docker–via host networking.

We credit our simple approach for our success with Docker. Take it one step at a time, build simple tools (or use ours!), do the pieces that offer the biggest gain for the least pain first, and you too may be able to take advantage of Docker has to offer.

About the author

Paul is a senior software engineer on the Site Engineering team. He has had a hand in lots of big burning man art and is constructing his own CNC laser cutter. He wrote some of the software the IRS uses to do your taxes and worked on hardware data management for the iPad.

Tell us your thoughts Or Send us an internal high five

Talk to @newrelic