How To Store Over 200 Billion Data Points a Day to Disk

People often ask how we store the the flood of data points that we receive every day, and the answer is surprising to most! The 200B plus data points per day that New Relic is receiving is all stored in Percona Server, a variant of the MySQL database.

Most would expect us to use a NoSQL solution to store all of this data — that’s what big data players do right? We chose Percona Server instead because of its performance characteristics, wide developer knowledge base, and proven track record. That lets us move and adapt faster to our changing needs using tools that our developers are familiar with. Six years later, we’re still using Percona, because it still performs for us.

As you can imagine, writing all of this data isn’t a trivial task. A single database instance would never be able to handle this level of punishment, or scale into the future. If you’re looking to scale your database tier, here are a few of the higher points of how pull it off:

Quick Rundown

  • Data Sharding was an obvious choice from the beginning. Three categories we focus on in our sharding are Scalability, Performance, and Resiliency. Data is split based on the type and usage, which allows performance tuning for the specific workload.
  • Bare Metal always wins the performance battle when you’re IO bound. All of our data storage is direct attached, and 100% SSD. There has been a lot of concern about using SSDs with databases, but we’ve found the drives hold up just fine. On average, we rewrote our initial batch of SSDs 180 times in 17 months, with 91% wear remaining!
  • Custom Tooling to manage and monitor the health of our system has proven invaluable. Health monitoring of the databases became critical to allow for graceful degradation of service if something broke. We created tools that would allow us to shuffle data around to allow for seamless maintenance and hardware upgrades. Take the time now, and save yourself the heartache later!
  • Tuning for your specific workload seems obvious, but it’s harder than you might think. Try to look at as many aspects of the problem as you can. We are extremely write-heavy, and in the days of spinning disks had to optimize for sequential writes to get data written in time. This killed reads, which are mostly random, and the solution was to create read-optimized covering index. Remember the tuning that you’ve done and revisit it whenever you make architectural changes.

Hopefully this piqued your interest, because I covered the lower-level details of how we accomplish our data storage in a Velocity Europe 2013 presentation (at which point we were collecting 194 billion data points). I addressed how to scale MySQL through hardware optimizations and sharding, all from a Site Engineering perspective. Included were some real-world examples of finding pain points, identifying risks, and evaluating the tradeoffs between cloud and hardware scaling.

Still interested? Dive into the fine points in my slides below!

Jonathan Thurman is a senior solutions architect at New Relic. As a longtime site reliability engineer for New Relic, he gained valuable experience on managing large scale systems that he now uses to assist New Relic customers. View posts by .

Interested in writing for New Relic Blog? Send us a pitch!