This post is part 3 of a series on Designing for Scale. Part 1 addressed When the Rewrite Is the Right Thing to Do, while Part 2 covered Building What You Need. We suggest reading parts 1 and 2 first to get essential context.
In Part 1 and Part 2 of this series, I discussed how the New Relic Insights team planned and implemented the replacement for our high-throughput customer-facing API service, the legacy Insights Custom Event Inserter (ICEI). To create a more reliable and maintainable software system, we chose to implement the new system in Ratpack, a modern, asynchronous HTTP request-processing framework.
As we implemented the application, we used the testability of Ratpack to create a deep unit test framework that validated all of our functional requirements, including repairs to standing bugs that existed in the legacy system. All that was left was to scale it up to production readiness.
The last 10% is 90%
When operating software at New Relic scale, there is a big difference between “feature complete” and “ready for production.” As I mentioned in part 2 of this series, we built the replacement for the legacy ICEI in two weeks, at which point it was feature-complete. It took the team another two months to prepare the software for operations at production scale. That may sound like a long time, but the effort required to operationalize a system typically dwarfs the effort it takes to build it.
Also in Part 2, I discussed the throughput and payload size requirements, and how we used the New Relic Java agent together with Insights to identify the shape of the payload distribution seen by our legacy production system. We knew that our replacement would have to meet these same demands. Preparing the ICEI for production required us to prove that the system was ready before we released it.
This was to make sure that no customer data would be lost or compromised as a result of the transition. In fact, we wanted the transition to the new ICEI to be completely transparent to our customers. If we did our jobs diligently, our customers would never notice a difference.
But how do you prove that a system can handle the demands of a production environment before you ship that system into your production environment?
Load and stress testing at scale
We knew up front what the required performance specs were for the ICEI because we had all the production performance data from the existing system readily available in New Relic APM and Insights. To properly test for the load and stress conditions of production, we needed to emulate the production load characteristics in a controlled environment.
Ideally, the stress and load testing would be performed using actual production data. When considering stress testing for the ICEI, the team originally intended to fork the production data and send it to both the legacy system and the ICEI in parallel. This approach would allow us to directly compare the data inserted by both systems to determine whether they were exhibiting identical behavior.
At New Relic, our data ingestion pipeline heavily leverages Apache Kafka—an open source data-message queueing system designed for high throughput and reliability. Unlike the systems at New Relic that consumes from Kafka, the ICEI accepts data uploads via HTTP. Kafka allows multiple consumers to read from the same queue, making it possible for both legacy and replacement systems to work in parallel and consume the same exact data.
However, since the ICEI is a Kafka producer, not a consumer, dual-dispatching the requests to both systems requires forking the data stream and dispatching it to both system for processing. To fork the data stream for the new ICEI would have required modifying the legacy ICEI to replay all of its input via HTTP to the new implementation.
We decided that was too risky, so the Insights team needed an alternative approach. We still had to put the system under a load equivalent to that seen by the legacy ICEI in our production environment, but we could not dual-dispatch the real data. Given these restrictions, we needed to construct an artificial load generator to exercise and test the new system.
Generating artificial load is not necessarily difficult, as there are many tools available that will do the job. But just generating a load is not sufficient to validate that the system is ready for production. Simply generating a uniform load that approximates the throughput of the production environment will prove that the system can sustain the traffic, but it will not surface deeper information such as how the software responds to highly variable conditions, the wide range of error conditions that arise during processing, or what occurs when the system is placed under duress due to infrastructure issues. To provide a useful measurement of the response to load, the load itself needs to emulate the conditions seen in production as closely as possible.
If you recall from Part 2, the Insights team used the power of the histogram function in the New Relic Query Language (NRQL), our powerful SQL-like query language, to determine the distribution of payload sizes seen by the legacy system in our production environment:
We used this data to design a load generator that would produce payloads in the same distribution and give us the ability to adjust to larger and larger volumes. Then we included a histogram of the generated payload sizes on the same dashboard that we used to monitor the new ICEI. With one dashboard we could display the histogram of the load generator right next to the histogram of the payloads processed by the system-under-test, and easily discern whether the behavior met our predictions.
So far I have focused on load testing. Equally important is how the system responds to stress. A production system must handle not only good payloads delivered via fast, reliable networks, it must also handle invalid, corrupted, or just plain nonsensical payloads delivered via unreliable and unpredictable networks.
As with the distribution of payload sizes the ICEI needed to support, we also needed to consider the types of errors normally experienced in production, and the frequency with which they occur. Since the legacy ICEI was an HTTP API, errors produced during processing of a request resulted in the response containing a known HTTP status code. The legacy ICEI was instrumented with the New Relic Java agent, which automatically records HTTP status codes on the Transaction events stored in the New Relic Database (NRDB). This allowed us to write a NRQL query that would tell us the relative frequencies of the various error codes we see in production:
The table visualization provides the numerical counts of the different status codes, but the pie chart visualization would provide a legend that included the percentages:
Using this data, we modified our load generator to induce these same types of errors at the same frequencies and apply them to randomly generated payloads that fit the histogram of payload sizes we measured in production. To validate our work, we added a pie chart to the load generator dashboard to display the breakdown of errors generated during a test run. With this data we could compare the load to the system under test and determine whether it was behaving properly.
The load generator we built is deployed as a Docker container, allowing us to bring multiple instances online in parallel to increase the load. It spreads the load across multiple accounts to force the system under test to respond to the widest variety of conditions, and allows the team to reconfigure the error rates and payload size distribution when initiating a test run to allow for the most flexibility.
With this tool in hand, we began testing the system in our staging environment to validate that it responded properly to inputs of various kinds while also determining the maximum sustainable throughput for one instance of the new ICEI.
We deployed the new ICEI using Docker and we needed to know how many instances would be required in our production environment to handle the load. Once we knew the scaling factor for one instance, we could calculate the total number of instances required given a particular memory and CPU limit configuration:
This reduction in hardware consumption shows the efficiency gains yielded by Ratpack due to asynchronous request processing. It also reflects one of the upsides to designing the system using data that we measured: dramatically lower cost of operations. Additionally, the switch to Docker means scaling up to handle increasing traffic is merely a matter of deploying more containers. The legacy ICEI could be scaled only by adding physical hardware to racks in our data center.
Once we knew how many instances we required to host the system, we dark deployed the new ICEI into production. Then we ran the load and stress test against the production system for multiple days, using our Insights dashboards and New Relic Alerting policies to track performance and stability. Once the testing was completed with 100% success, we knew that we had built a system that was ready to handle our production load.
Time to deploy…
Freeze frame. Record scratch.
I don’t want to spoil the surprise, but of course it wasn’t that easy. Once we exposed the system to actual production traffic, we discovered that the load generator we built was still nowhere close to emulating the real thing. We had a lot more work to do before the job was done.
In the next post in the “Designing for Scale” series, I’ll dive into the complex problems we encountered when we began canary testing the application in production. Stay tuned!