In modern software environments, like those built on scalable microservices architectures, hitting capacity limits is a common cause of production-level incidents. It’s also, arguably, a type of incident teams can often prevent through proactive planning.
At New Relic, for example, our platform is made up of services written and maintained by more than 50 engineering teams, and capacity planning is a mandate for every one of them—we can’t afford for our real-time, data platform to hit capacity limits. The first time through, each team spends several days focused on the analysis and development work needed to model their capacity needs. Once they have their capacity models in place, the ongoing practice of planning occupies, at most, a few hours a quarter—a time investment that’s more than worth it if it prevents just one incident per year.
To help make the process as smooth and repeatable as possible, the New Relic site reliability engineering team publishes a “capacity planning how-to guide” to walk teams through the process of capacity planning. This post was adapted from that guide.
What is capacity planning?
Simply put, capacity planning is work teams do to make sure their services have enough spare capacity to handle any likely increases in workload, and enough buffer capacity to absorb normal workload spikes, between planning iterations.
During the capacity-planning process, teams answer these four questions:
- How much free capacity currently exists in each of our services?
- How much capacity buffer do we need for each of our services?
- How much workload growth do we expect between now and our next capacity-planning iteration, factoring in both natural customer-driven growth and new product features?
- How much capacity do we need to add to each of our services so that we’ll still have our targeted free capacity buffer after any expected workload growth?
The answers to those four questions—along with the architectures and uses of the services—help determine the methodology our teams use to calculate their capacity needs.
We use three common methodologies to calculate how much free capacity exists for a given service:
- Static-resource analysis
It’s important to note that each component of a service tier (for example, application host, load balancer, or database instances) requires separate capacity analysis.
Service starvation involves reducing the number of service instances available to a service tier until the service begins to falter under a given workload. The amount of resource “starvation” that’s possible without causing the service to fail represents the free capacity in the service tier.
For example, a team has 10 deployed instances of service x, which handle 10K RPM hard drives in a production environment. The team finds that it’s able to reduce the number of instances of service x to 8 and still support the same workload.
This tells the team two things:
- A single service instance is able to handle a max of 1.25K RPM drives (in other words, 10K drives divided by 8 instances).
- The service tier normally has 20% free capacity: Two “free” instances equals 20% of the service tier.
Of course, this scenario assumes that the service tier supports a steady-state of 10K RPMs; if the workload is spiky, there may actually be less (or more) than 20% free capacity across the 10 service instances.
Load generation is effectively the inverse of service starvation. Rather than scaling down a service tier to the point of failure, you generate synthetic loads on your services until they reach the point of failure.
A percentage of your normal workload then is based on the amount of synthetic workload that you were able to successfully process. This represents the free capacity in your service tier.
This approach involves identifying the most constrained computational resource for a given service tier (typically, CPU, memory, disk space, or network I/O) and determining what percentage of that resource is available to the service as its currently deployed.
Although this can be a quick way to estimate free capacity in a service, there are a few important gotchas:
- Some services have dramatically different resource consumption profiles at different points in their lifecycle (for example, in startup mode versus normal operation).
- It may be necessary to look at an application’s internals to determine free memory. For example, an application may allocate its maximum configured memory at startup time even if it’s not using that memory.
- Resources in a network interface controller (NIC) or switch typically reach saturation at a throughput rate lower than the maximum advertised by manufacturers. Because of this, it’s important to benchmark the actual maximum possible throughput rather than relying on the manufacturer’s specs.
No matter which methodology you choose, experiment during both peak and non-peak workload periods to get an accurate understanding of what the service can handle.
Now, let’s look at how to apply these methodologies in a capacity-planning exercise.
Capacity planning: a complete example
Our capacity planning comprises five main steps. Teams work through these steps and calculate their capacity needs in a template, an example of which is included below.
- List your services, and calculate each service’s free capacity using one of the available methodologies discussed above. Free capacity is generally best expressed as a percentage of overall capacity; for example, “This service tier has 20% free capacity.
- Determine the safest minimum amount of free capacity you need for each service. Free capacity is your buffer against unexpected workload spikes, server outages, or performance regressions in your code.
- Typically, we recommend a minimum of 30% free capacity for each service.
- In all cases, teams should scale their services to at least n+2—they should be able to lose two instances and still support the service tier’s workload.
- Determine when you will next review your capacity needs, and hold that date. You should review services that are mature and experiencing typical growth quarterly, and review new services or those that are experiencing rapid growth monthly.
- Project the percentage of workload growth that your service is likely to experience before your next capacity review meeting. Base this projection on historical trend data and any known changes—such as new product features or architectural changes—that may impact this growth.
- Calculate how much capacity you’ll need to add to your service before your next capacity review, so that you can maintain your target free capacity and support your expected growth.
Capacity planning template
Record the results of your calculations in a template, and make the information accessible to all stakeholders in the larger engineering organization (for example, site reliability engineers, engineering managers, and product owners). This sample template covers capacity planning for a Java-based service:
Java SVC instances
|C||Scheduled date for next capacity planning exercise||7/1/2019|
|D||Methodology used||Static-resource analysis|
|E||Current service tier size|
(# of hosts or container instances)
|F||Current cores per service instance||10|
|G||Current storage per service instance||100GB|
|H||Determined free capacity|
(as a percentage)
Formula: E - (E * H)
|60 - (60 * .2) = 48|
|J||Target free capacity|
(as a percentage; should represent at least two free instances, or n+2)
|K||Expected workload growth until date of C||15%|
|L||Capacity needed to service minimum planned workload|
Formula: I + (I * K)
|48 + (48 * .15) = 55.2|
|M||Capacity needed to maintain target free capacity|
Formula: L + (L * J)
|55.2 + (55.2 * .3) = 71.76|
|N||Additional capacity to be added|
Formula: roundup(M - E)
|roundup(71.76 - 60) = 12|
|O||Additional cores needed|
Formula: N * F
|12 * 10 = 120|
|P||Additional storage needed|
Formula: N * G
|12 * 100GB = 1.2TB|
Iterate on the capacity plan
After teams establish a regular cadence of capacity planning, it’s necessary to iterate on the data they collect. Future decisions about capacity should be informed by the difference between the forecasted capacity and the capacity they actually needed.
Teams should ask such questions as: What accounted for any differences? Did services grow as expected? Or did growth slow? Were there architectural changes or new features to account for?
These questions can uncover whether or not teams forecasted their growth appropriately in relation to organic growth.
In rare cases, teams may struggle to properly calculate their capacity needs. Here we advise teams to plan time to reduce their capacity needs or work to make their services more efficient.
We also encourage teams to set up proactive alerting (for example, for out-of-memory kills or CPU throttling) in case actual growth exceeds their forecasts and their services hit capacity limits before the next review.
Of course, even the best capacity planning efforts won’t necessarily prevent all related production incidents in modern software environments. But careful, consistent, iterative, and proactive planning can go a long way towards minimizing capacity-related incidents. And that makes it totally worth the effort.