This post is the conclusion of a four-part series on Designing for Scale. Part 1 in the series addressed When the Rewrite Is the Right Thing to Do, Part 2 covered Building What You Need, and Part 3 was about Scaling Under Stress. We suggest reading the earlier parts first to get up to speed on the story.
In the previous installments of this series, I discussed how the New Relic Insights team planned and implemented a replacement for our Insights Custom Event API—(ICEI), a high-throughput, customer-facing API service—and then scaled it to handle New Relic production loads. This required us to model the traffic we see in production, and build a load simulator that would mimic that traffic sufficiently to ensure that our system was capable of processing our full production load. Once the team had the software stable and operating at production volumes, it was time to deploy it!
Everything went wrong
At this point in the process, the team had high confidence in the new system. We knew that it could process the volume of requests we see in production, and we knew it could handle all the standard error conditions. Of course, we also realized that actual production data is far more variable, random, and unpredictable than simulated traffic.
To minimize the risk of introducing new systems or behavior into our production environment, we use canary deploys. Our legacy system operated on 12 hosts that sat behind a load balancer. This environment lent itself well to a canary deploy, because we could simply add a new host to the load balancer running our new system. The load balancer would distribute a share of our incoming traffic to the new instance, and we would monitor the behavior while comparing it to the legacy system.
Shortly after the new instance began accepting traffic, we started seeing a growing number of 503 status codes returned from the new software, indicating that we were denying incoming requests due to the server being too busy. Since this service is a data-ingestion endpoint, denying requests is a very bad thing.
CPU consumption was nominal, memory profile was flat with no garbage collection issues, but for some reason the server was denying requests. We immediately shut down the instance to investigate the issue.
Check the throttle
One of the many cool features of Ratpack (our request-handling framework) is support for throttles. A throttle is a concurrency limiter that can be applied to any Promise chain that places an upper boundary on the number of requests that will be executing concurrently through the chain. This feature is fantastic when you need to limit the number of requests executing a particular code block, such as outbound calls to other services.
In our case, though, we had to be cognizant that ICEI is a data ingestion service, and should never allow data to be lost. The ICEI ultimately publishes data out to Kafka for downstream ingestion into the New Relic Database (NRDB). If the ICEI is unable to publish to Kafka, requests will begin backing up in memory. If this continues for too long, the JVM runs out of memory and the process will be killed. All the requests that were in memory are lost when the process shuts down, taking their data with them.
We use the throttle mechanism to limit the number of requests we allow into memory at once. When we added the throttle to the ICEI, we also added back pressure that would proactively deny incoming requests if the throttle was full and we did not have adequate resources to allow any more payloads into memory. When this happens, new requests should be queued by Netty, the underlying async processing framework that powers Ratpack.
It took the team a week to track down the issue, but we discovered that instead of queueing the incoming requests, the service was returning a 503 status code to the caller. This was a hint that we needed to look at the throttle.
We instrumented the application by recording a set of custom attributes on the Transaction event to track the status of the throttle. Ratpack makes it easy to obtain the number of requests executing inside the throttle, as well as the number that are queued awaiting entry, so we grabbed those values and used the New Relic Java agent to record them:
final Throttle throttle = ctx.get(Throttle.class); NewRelic.addCustomParameter("throttleSize", throttle.getSize()); NewRelic.addCustomParameter("throttleActive", throttle.getActive()); NewRelic.addCustomParameter("throttleWaiting", throttle.getWaiting());
We then redeployed the canary and accumulated enough data to reproduce the issue, while being careful to take the instance out of our load balancer pool before it completely consumed available memory. Then I used Insights to chart these values as a time series using the max function. This showed me the worst-case values per interval over the course of our testing:
It was immediately clear from that chart that we were leaking requests through the throttle. This caused incoming requests to begin queuing, and eventually the system stopped accepting requests entirely.
Chunked encoding and timeouts
The investigation into why we were leaking throttles uncovered two scenarios that might induce the issue. The first was chunked encoding, a method of splitting a large data transfer into fixed-sized chunks. As it happened, some of our customers were using chunked encoding to send us event payloads and the legacy application had support for it. We instrumented the application to record a custom attribute whenever we detected that the content was chunk encoded, and used Insights to determine how frequently we saw them.
The number of chunked payloads in a single hour made it clear that they accounted for a lot of our leaked throttles. Ratpack supports chunked encoding, but since the content length is not known in advance for a chunk encoded payload, a chunked payload can exceed our maximum supported payload size of 5MB. Ratpack promise chains allow for the insertion of an error handler invoked if the maximum payload size is exceeded when streaming a payload into memory, which is how chunked encoding is handled. We added an explicit error handler to the Ratpack promise chain to trap and cleanly handle oversize chunked payloads. We deployed the change, and saw an immediate difference. This time, the canary ran for four hours before a leak was detected. Leaks were still occurring, but at a much slower rate, and not because of chunked encoding.
Further investigation uncovered another error case: If the client timed out or prematurely disconnected during the request transfer, a throttle counter could leak. It took a solid week of investigation and testing but we were able to determine that Netty supported a customizable socket timeout duration. If the connection was idle for longer than the timeout, Netty terminated the connection and cleaned up the resources. The version of Ratpack we were using did not let us alter the timeout value, nor did it handle the error from Netty when a timeout occurred.
Now that we suspected a cause for the leak, we needed to reproduce it. So we enhanced our load generator (see Part 3 of Designing for Scale for more information) to include chunked payloads and unstable clients. The unstable clients randomly introduced delays in transmission, and sometimes terminated the connection prior to completion. We ran the load generator against the ICEI, and were immediately able to reproduce the leak.
(Let’s take a moment to let everyone know how amazing the Ratpack developers are. They operate a Slack channel where they respond to user issues, and provided immediate feedback on our issue. The Ratpack developers were able to craft an enhancement that addressed our needs and push the fix into the upcoming 1.5.0 version of the framework. We got a test build of the new feature, and retested it.)
We ran the load generator against the ICEI with the modified Ratpack version and found that we no longer leaked any requests through our throttles. The bug was fixed, and our load generator detected any regressions.
The next step was to try another canary deploy to our production environment and validate that the changes fixed the remaining issues. However, we were working with a pre-release build of Ratpack 1.5.0, and our team was not comfortable operating a pre-release framework in our production environment. We decided that we would be willing to accept the risk of deploying a release candidate as long as it passed all our testing, but would not risk operating a pre-release.
This decision placed the team in a tough spot: we needed a release candidate build of Ratpack that would allow us to deploy the application into our production environment, but that meant we had to wait for Ratpack. We opted to develop a fallback plan to get the system into production if a Ratpack release candidate was delayed.
Our fallback plan was to fork Ratpack 1.4 and cherry-pick the new features from 1.5 that we needed to support socket timeouts. This would require us to own a fork of the Ratpack framework and accept responsibility for keeping it updated. Eventually, of course, we would migrate to Ratpack 1.5, so we would be supporting this custom fork for only a short period of time. The team decided that if a Ratpack release candidate was not published within two weeks, we would proceed with our fallback plan.
Remember a few paragraphs ago when I was complimenting the Ratpack developers for their responsiveness? Two weeks after the initial change was made, and shortly before we were planning to begin work on our fallback plan, Ratpack 1.5.0-RC1 was published. We immediately upgraded from the pre-release version, and began a regression test of the system. The framework was stable, and the application was behaving itself, so the Insights team took a vote and decided to move forward with final production testing. It was time to build a canary and deploy it into our production environment.
We put the canary into service at 11 a.m. on a Wednesday and monitored it closely. After eight hours, two things were clear. First, there had not been a single throttle leak or failed request. Second, the new system was significantly more efficient than the legacy system it had replaced.
We had 11 production instances of the legacy system, so when we added the new instance to the pool, we expected it to take 1/12 of the traffic. Within five minutes of joining the load balancer, the new implementation was processing 1/8 of all our production traffic. Our load balancers are “smart” and route more traffic to instances that can process more transactions. In this case, the new implementation was processing requests faster and with higher parallelism than the legacy system, and the load balancer proactively routed more traffic to it.
After 48 hours, the team decided we were ready to begin transitioning the systems. We added nine more instances of the new system to the load balancer, and then let the system run for 72 hours. During this time, the new systems took over 70% of all traffic, despite accounting for less than half the instances in the pool. The systems were stable and no memory issues or request leaks were detected. We decided it was time to decommission the legacy software, and began shutting down the instances. We transitioned slowly to ensure that all requests were processed and that we didn’t see any sudden spikes in traffic in the new software that would indicate trouble. After two hours the old system was offline and the new ICEI was handling 100% of our customer traffic.
A new definition of success
The new ICEI was fully in production on a Monday at noon. We closely monitored the system for any indications of trouble for the rest of the day, and didn’t see any issues. Our support organization didn’t receive any customer tickets. We didn’t get paged. Our data insert rates were nominally the same as before.
We had successfully replaced a customer-facing API with a new software system, and nobody noticed! The new system offers a better user experience for our customers by improving the error messages we return when there is an issue with the payload. The engineering team now has a software system that is far easier to enhance, allowing us to build new functionality to support CORS in less than a week, enabling customers to submit custom events from browser applications. This is functionality that we never would have been able to support in the legacy system.
This is the definition of success for software at New Relic. By constantly maintaining focus on preserving customer experience, planning for error-free transitions, and building for resilience we can ship software like the ICEI without issues. This means we can improve our systems and the customer experience we provide without risking degradation of service.
It took nearly three months from inception to full production rollout of the new ICEI software system, but we did it. And, we did it without any critical issues.
Planning the provisioning, testing, and rollout of a system with the needs of the ICEI is difficult. It requires an entire team of dedicated engineers with the support of an organization that understands the value of investing in rework when necessary to increase reliability. An entire team spending three months on replacing an existing system is a big commitment of time and budget. In order to ensure that the work we did justified the expense, we needed to structure the development cycle such that we could quickly change tactics if insurmountable problems arose, including abandoning the work should that become necessary.
Above all else, none of this would have been possible without New Relic APM and New Relic Insights. The products provided us with the instrumentation, analysis, and visualization methods we needed to scale the system to production while maintaining confidence in the stability, reliability, and data integrity required. Our products are capable of this because the engineering teams that build them place the same level of expectations on the products that we place on the microservices that power them, and the data-ingestion tier that supports them.
The Insights team is my team, and I could not be prouder of the work we did to stand up the new ICEI and roll it out to our customers.
Don’t miss all installments of the “Designing for Scale” series:
- Designing for Scale: Part 1—When the Rewrite Is the Right Thing to Do
- Designing for Scale: Part 2—Building What You Need
- Designing for Scale: Part 3—Scaling Under Stress