We take performance and reliability seriously at New Relic. During this past holiday shopping season, for example, we were supporting Fortune 50 e-commerce customers that rely on New Relic to monitor their digital storefronts. We wanted to make sure that our systems would be up and and running smoothly under peak traffic, including for Black Friday and Cyber Monday.

Making sure we can scale reliably for our customers requires a significant amount of technical performance load testing work by the New Relic Browser team. Here’s a peek behind the scenes at what goes on:

Architecting for scale

While holiday shopping often brings new traffic records for retailers, the engineering teams at New Relic need to build for scale all year long. In the last quarter alone, our customers used the New Relic Digital Intelligence Platform to collect more than 25 trillion events. That works out to more than 3 million events every second, which New Relic needs to ingest, analyze, and store to meet our customers’ digital intelligence needs. For perspective, that’s more events than there are Google searches or Tweets per second, by multiple orders of magnitude. Users of New Relic Browser collect tremendous volumes of data from their frontend applications using our JavaScript agent, including such data points as events, transactions, errors, and more.

To handle all this data, our data pipeline includes a number of different data stores, including Kafka, Cassandra, Percona SQL, and others, which then flow into a massively parallelized SSD array that underpins our New Relic Insights data platform. A containerized microservices environment keeps everything running, in a polyglot language environment including Go, Ruby, Node.js, Java, and more. (No, we’re not actively trying to use all the buzzwords.)

Designing better seasonal performance loads

Every year, in preparation for the holiday shopping season, we conduct extensive performance tests on our systems several months in advance to make sure everything is ready to handle the load. While it might be easier to simply throw a bunch of load at our systems, we think it’s more effective to test for some of the interesting traffic patterns that we see during the holiday season:

  • More new errors: E-commerce companies often push out new microsites and webpages for the holidays, which may not get as much testing as their normal sites due to the holiday crunch. So, we generally see a higher volume of errors as well as a greater variety of new errors being reported.
  • Open connections: Many e-commerce websites get bombarded with users checking out hot deals, who then bounce immediately in search of the next bargain. This can mean low connection reuse during the Cyber Monday rush, leaving a lot of open connections during the flood of traffic.
  • SSL Security: There are increased volumes of secure purchase and payment traffic during the holiday season, and we similarly see a significant uptick in SSL traffic this time of year. SSL traffic places more load on servers to manage encryption, and we opted to make our load testing traffic 100% SSL.

We included these unique, holiday-specific traffic patterns into the larger set of load tests we use to run our systems through their paces at traffic volumes significantly higher than our holiday forecasts. These standard tests include:

  • Volume: We spun up a collection of Amazon Web Services instances generating throughput against our systems, each running custom-written services designed to stress our data pipeline. Each instance had its own external IP address to ensure that the network wouldn’t be the bottleneck.
  • Ramp: We created a linearly scaling load designed to push our systems to double our peak load. After verifying that the load functioned on a single instance, we ramped up 10 instances at a time. We continued ramping up blocks of instances until we hit our load targets.
  • Request types: We replicated the distribution of traffic and data types seen under our typical production load, with load generated across different data types, such as events, transactions, errors, and more.
  • Complex session traces: Session traces are a particularly complex data type due to the large number of components that can make up a trace. To push our Cassandra cluster, we constructed abnormally large and complicated session traces to use during our load testing.
  • Account traffic types: We used different New Relic account types (Pro and Lite) as a way to replicate the different user traffic bases that we expected to see.
  • Account size: We created hypothetical customer accounts many times larger than any actual current customer to ensure that our largest customers would have smooth scaling experiences.

While load testing is an industry standard practice, it can sometimes get lost among competing priorities. So we’ve worked to build a team culture that helps to make sure this important work gets done. The New Relic Browser engineering team sits 10 feet away from a wallboard TV screen that displays an Insights dashboard monitoring the real-time health of our systems, and we include this load testing as part of ongoing system-hardening work that our team manages.

Modeling scaling behavior in production

Designing and setting up these load tests has provided some invaluable learnings. In particular, it’s helped us create a quantitative model of how our systems handle load. Queueing theory suggests that a system with concurrent behavior where work runs in parallel (such as with memory, networking, CPU, etc.) will often exhibit flat response times as it begins to process traffic. As traffic increases and begins to saturate the capacity of the concurrent system, performance eventually hits an inflection point where response times increase quickly. If traffic continues to rise, things degrade quickly until the system is fully bogged down … and I get woken up by a late-night pager alert.

Freeway congestion is a great analogy. At 3 a.m., few cars are on the road, and traffic runs at full speed. Even as traffic increases during the morning, there is still enough capacity for everyone to continue driving at full speed. However, as traffic continues to increase during rush hour, the road eventually reaches an inflection point when cars start slowing down as more contention creeps into the system due to cars waiting to merge or bottlenecking at ramps. If traffic increases after this inflection point, things degrade quickly until the freeway is fully gridlocked.

response time model

Response times degrade quickly after an inflection point is reached.

Not surprisingly, our engineering team uses New Relic to monitor New Relic, relying on APM and Browser to track these response times, and we have Alerts set up in New Relic so that if any erosion occurs to these response times, we can step in early before the freeway becomes gridlocked.

During our load-testing exercise, monitoring response time helps us see this inflection point down the road, giving us confidence about how—and how far—our systems will scale. New Relic monitoring helps us know not only what’s happening today, but also helps inform us as to what might happen six months from now.

Load testing has other benefits as well. It helps us do capacity planning, since we have a data-driven inflection point that we can forecast around. It helps us find bottlenecks to address in order to push out the inflection points. We’ve deployed New Relic instrumentation deep into our different services and data stores so we can find the best places to tune. We can stress the system with load testing more extreme than our production traffic, which provides extra confidence that they’ll perform smoothly under production load. We can look at the different scaling behaviors of our various services, such as by processing time, bytesize, or network, as well as of our underlying Infrastructure.

Black Friday/Cyber Monday traffic patterns

Overall, our load test results inspired confidence that we could handle Black Friday/Cyber Monday without incident. And, in fact, they were pretty much non-events for New Relic Browser customers.

Our historical New Relic Insights dashboards show how New Relic Browser performed during the Black Friday/Cyber Monday peak of the holiday season. Black Friday turned out to be a complete non-event, with traffic roughly in line with typical business day traffic. At 8 a.m. on Cyber Monday, however, we set a Browser customer traffic record for all data types (excluding load testing). We saw a 24% increase in bandwidth across a mix of five different traffic data types, including page views, page actions, JavaScript errors/AJAX, single-page application events, and session-trace data.

The largest portion of our data is page views, since they are available to all kinds of New Relic accounts, and as you can see in the chart below, we saw a 29% increase in page views:

page views

Page view data traffic during Cyber Monday.

In addition, as we predicted, many sites deployed new deal sites specifically for Cyber Monday with less than average testing. Page load and JavaScript error data represented by far the largest percentage increase in traffic volume, with a 56% bump:

page load chart

Increased volumes of page load and JavaScript error data.

Getting ready for next year

As planned, our load testing far exceeded what New Relic Browser actually experienced during the holiday rush—a huge comfort to our team members carrying the on-call pagers. More important, though, we made sure our systems remained ready and available to our customers during peak periods.

Whether it’s streaming Game 7 of the World Series for MLBAM, livecasting the 2016 presidential election for Gannett’s USA Today, or making it easy for customers to buy holiday gifts from John Lewis, we’re here to help our customers keep their websites and apps up and running when it matters most.

 

For more information, see New Relic’s Browser documentation and video library.

David Copeland is a senior site reliability engineer on the New Relic Browser team based in Portland, Oregon. View posts by .

Interested in writing for New Relic Blog? Send us a pitch!