We take performance and reliability seriously at New Relic. During this past holiday shopping season, for example, we were supporting Fortune 50 e-commerce customers that rely on New Relic to monitor their digital storefronts. We wanted to make sure that our systems would be up and and running smoothly under peak traffic, including for Black Friday and Cyber Monday.
Making sure we can scale reliably for our customers requires a significant amount of technical performance load testing work by the New Relic Browser team. Here’s a peek behind the scenes at what goes on:
Architecting for scale
To handle all this data, our data pipeline includes a number of different data stores, including Kafka, Cassandra, Percona SQL, and others, which then flow into a massively parallelized SSD array that underpins our New Relic Insights data platform. A containerized microservices environment keeps everything running, in a polyglot language environment including Go, Ruby, Node.js, Java, and more. (No, we’re not actively trying to use all the buzzwords.)
Designing better seasonal performance loads
Every year, in preparation for the holiday shopping season, we conduct extensive performance tests on our systems several months in advance to make sure everything is ready to handle the load. While it might be easier to simply throw a bunch of load at our systems, we think it’s more effective to test for some of the interesting traffic patterns that we see during the holiday season:
- More new errors: E-commerce companies often push out new microsites and webpages for the holidays, which may not get as much testing as their normal sites due to the holiday crunch. So, we generally see a higher volume of errors as well as a greater variety of new errors being reported.
- Open connections: Many e-commerce websites get bombarded with users checking out hot deals, who then bounce immediately in search of the next bargain. This can mean low connection reuse during the Cyber Monday rush, leaving a lot of open connections during the flood of traffic.
- SSL Security: There are increased volumes of secure purchase and payment traffic during the holiday season, and we similarly see a significant uptick in SSL traffic this time of year. SSL traffic places more load on servers to manage encryption, and we opted to make our load testing traffic 100% SSL.
We included these unique, holiday-specific traffic patterns into the larger set of load tests we use to run our systems through their paces at traffic volumes significantly higher than our holiday forecasts. These standard tests include:
- Volume: We spun up a collection of Amazon Web Services instances generating throughput against our systems, each running custom-written services designed to stress our data pipeline. Each instance had its own external IP address to ensure that the network wouldn’t be the bottleneck.
- Ramp: We created a linearly scaling load designed to push our systems to double our peak load. After verifying that the load functioned on a single instance, we ramped up 10 instances at a time. We continued ramping up blocks of instances until we hit our load targets.
- Request types: We replicated the distribution of traffic and data types seen under our typical production load, with load generated across different data types, such as events, transactions, errors, and more.
- Complex session traces: Session traces are a particularly complex data type due to the large number of components that can make up a trace. To push our Cassandra cluster, we constructed abnormally large and complicated session traces to use during our load testing.
- Account traffic types: We used different New Relic account types (Pro and Lite) as a way to replicate the different user traffic bases that we expected to see.
- Account size: We created hypothetical customer accounts many times larger than any actual current customer to ensure that our largest customers would have smooth scaling experiences.
While load testing is an industry standard practice, it can sometimes get lost among competing priorities. So we’ve worked to build a team culture that helps to make sure this important work gets done. The New Relic Browser engineering team sits 10 feet away from a wallboard TV screen that displays an Insights dashboard monitoring the real-time health of our systems, and we include this load testing as part of ongoing system-hardening work that our team manages.
Modeling scaling behavior in production
Designing and setting up these load tests has provided some invaluable learnings. In particular, it’s helped us create a quantitative model of how our systems handle load. Queueing theory suggests that a system with concurrent behavior where work runs in parallel (such as with memory, networking, CPU, etc.) will often exhibit flat response times as it begins to process traffic. As traffic increases and begins to saturate the capacity of the concurrent system, performance eventually hits an inflection point where response times increase quickly. If traffic continues to rise, things degrade quickly until the system is fully bogged down … and I get woken up by a late-night pager alert.
Freeway congestion is a great analogy. At 3 a.m., few cars are on the road, and traffic runs at full speed. Even as traffic increases during the morning, there is still enough capacity for everyone to continue driving at full speed. However, as traffic continues to increase during rush hour, the road eventually reaches an inflection point when cars start slowing down as more contention creeps into the system due to cars waiting to merge or bottlenecking at ramps. If traffic increases after this inflection point, things degrade quickly until the freeway is fully gridlocked.
Not surprisingly, our engineering team uses New Relic to monitor New Relic, relying on APM and Browser to track these response times, and we have Alerts set up in New Relic so that if any erosion occurs to these response times, we can step in early before the freeway becomes gridlocked.
During our load-testing exercise, monitoring response time helps us see this inflection point down the road, giving us confidence about how—and how far—our systems will scale. New Relic monitoring helps us know not only what’s happening today, but also helps inform us as to what might happen six months from now.
Load testing has other benefits as well. It helps us do capacity planning, since we have a data-driven inflection point that we can forecast around. It helps us find bottlenecks to address in order to push out the inflection points. We’ve deployed New Relic instrumentation deep into our different services and data stores so we can find the best places to tune. We can stress the system with load testing more extreme than our production traffic, which provides extra confidence that they’ll perform smoothly under production load. We can look at the different scaling behaviors of our various services, such as by processing time, bytesize, or network, as well as of our underlying Infrastructure.
Black Friday/Cyber Monday traffic patterns
Overall, our load test results inspired confidence that we could handle Black Friday/Cyber Monday without incident. And, in fact, they were pretty much non-events for New Relic Browser customers.
The largest portion of our data is page views, since they are available to all kinds of New Relic accounts, and as you can see in the chart below, we saw a 29% increase in page views:
Getting ready for next year
As planned, our load testing far exceeded what New Relic Browser actually experienced during the holiday rush—a huge comfort to our team members carrying the on-call pagers. More important, though, we made sure our systems remained ready and available to our customers during peak periods.
Whether it’s livecasting the 2016 presidential election for Gannett’s USA Today or making it easy for customers to buy holiday gifts from John Lewis, we’re here to help our customers keep their websites and apps up and running when it matters most.