In Part 1 of this series, I covered the decision-making process the New Relic Insights team used when we opted to replace a legacy service implementation with a rewrite in a new technology stack. One key aspect of this project was taking advantage of the opportunity provided by the rewrite to re-examine the needs of the system and implement only what was required.
In this second post in the series, I dive into the process by which we identified the actual real-world requirements of the software and built the system we needed.
What do we need?
While planning, we identified the challenge to validate that the new software would properly handle the variety and magnitude of data seen in our production system. This would be a critical part of the rewrite of the legacy Custom Event API (ICEI), which is a high-throughput, customer-facing API service.
So our first task was to identify the behavior we needed to retain and implement a functional equivalent to prove our theory that Ratpack, a modern, asynchronous HTTP request-processing framework, would solve the problem more cleanly than did the existing implementation.
Ratpack models HTTP request processing using handlers. Each handler is a link in a chain, and can either render a response or forward to the next handler in the chain. This fine-grained level of detail allows the application intake pipeline to be efficiently modeled with a set of handlers, each of which is custom tailored to a specific behavior.
The team dedicated time up front to determine all the functional behavior we needed, and to construct a set of handlers that matched those functions. For instance, we created a handler that extracts the account ID from the URL of the request and validates that it a) exists and b) can be parsed to a valid numeric value. If either condition fails, the handler returns an error to the caller indicating the exact reason for the failure. Following this pattern, the entire request pipeline was modeled as a sequence of handlers, terminating in a response with HTTP status 200 if the data is valid and has been dispatched for insertion into the New Relic Database (NRDB).
From inception to feature-completion it took the team less than two weeks to implement the system, which included the time spent building a thorough unit test suite for all of the handlers and utility classes. This includes unit tests for all of the possible error states and the specific HTTP status code and body content we return to the customer, which lets us detect any regressions in behavior as we enhance the product. It is worth noting that the enhanced error reporting is an improvement over the legacy system that satisfies frequent feature requests from customers.
Functionally, we now had a system that could replace the legacy software. However, we had no idea whether it would perform at scale.
The needs of production
The legacy Custom Event API had been operating in production for more than two years, and had become quite popular with our customers:
The service handles more than 350K requests per minute at peak traffic, with payloads that vary in size. The core responsibility of the API is to accept custom event data in JSON format, convert it to event batches, and dispatch those batches to NRDB via Kafka. Over its lifetime, the legacy system was hit by increasingly large payloads—up to 800MB per payload.
To be clear: 800MB of JSON is a lot of JSON.
We identified two distinct production requirements: high throughput and variable payload size. These two requirements posed an interesting challenge: the higher the percentage of traffic that consists of large payloads, the more difficult it was going to be to maintain high throughput. It takes a lot of processor cycles and a lot of memory to deserialize hundreds of megabytes of JSON and convert it into event batches.
As a result, the system we had was heavily optimized to efficiently ingest massive JSON uploads.
But then it occurred to me: there’s a hardcoded 5MB limit on the payload size the system will accept. We ran across it while investigating the source code when we inherited the code base. So we didn’t actually need to process JSON payloads in the hundreds of megabytes. We only had to handle 5MB at a time. However, processing a 5MB payload can still be a very slow operation, which was still going to be an issue for throughput. We had to plan ahead for scale.
I decided that before we did any optimization, in addition to the throughput we needed to know the shape of the production traffic. I needed to see the distribution of payload sizes. The first step was to add additional instrumentation. I did this by adding a custom attribute to the existing transaction events, which were being generated by the New Relic Java agent monitoring the ICEI. This attribute captured the payload size in bytes:
final long contentLength = context.getRequest().getContentLength(); NewRelic.addCustomParameter("payloadSizeBytes", contentLength);
Notice how simple that was: one line of code that calls the New Relic agent and stores the payload size as a custom attribute on the transaction events. I was able to make this change and deploy it within a matter of minutes.We immediately began seeing the payload size information in our Insights dashboards for ICEI. Once the change was deployed, I was able to use the power of the NRQL histogram function to generate a visualization of the payload size distribution:
In this query, each bucket is 100 KB. The largest bucket contains everything between 100 KB and 5MB. It is immediately apparent that we see far more small payloads than anything else. In fact, more than 90% of all payloads are less than 1 KB and 99% of all payloads are less than 10 KB.
Now we had identified a fundamentally important aspect of scaling the new software for production: the vast majority of our traffic is small payloads that will produce a small number of event batches. They will process quickly, and we need to handle many of them in parallel. Although we do have to allow for payloads up to 5MB, they are rare. We decided that the system would gate all uploads larger than 100KB and limit them to no more than 8 at a time to prevent CPU starvation and memory problems. Smaller payloads would not be limited in order to let them process as swiftly as possible with high concurrency.
At this point, we had seen a lot positive outcomes from the initial experiment, and we had determined the core operational needs of the production system. The software had a unit test suite that validated all of the functional needs, giving us a high degree of confidence that it was API-compliant with the legacy system. Now it was time to scale it up and deploy it, right?
In fact, we were just beginning the real work. Preparing a system for operational readiness is a much bigger effort than the construction of the initial software. Consider this: once the initial experiment was kicked off, the team had it functionally equivalent to the legacy system two weeks later. Getting the system ready for production operation took another two months of hard work.
In the next post in this series, I’ll address the load- and stress-testing process we used to scale up the system to production levels, and the tools we used to track our progress.
Don’t miss all installments of the “Designing for Scale” series:
- Designing for Scale: Part 1—When the Rewrite Is the Right Thing to Do
- Designing for Scale: Part 3—Scaling Under Stress
- Designing for Scale: Part 4—Deployment Surprises