At Rackspace, we’re immensely excited about our partnership with New Relic that helps power our DevOps Automation Service. Since announcing the partnership in early October, we’ve turned our attention maximizing the impact of Fanatical Support and Application Monitoring in solving common customer problems. In this post I’ll address one of these areas, auto scale.
Auto scale is the act of changing the pool of resources able to serve requests without manual intervention. Unfortunately, the term is often used as a throwaway comment, which does a disservice to the complexity of engineering required to do auto-scale properly. To demonstrate, let’s examine what good auto scaling looks like and how we are building it at Rackspace.
Before we dive into the details, it may be helpful to parse the definition:
The act of changing the pool of resources able to serve requests without manual intervention
My choice of words was very deliberate, helping to show just how complex good auto scaling is:
- Changing: I used “changing” rather than “increasing” because a proper auto scale implementation should be as good (if not better!) at scaling down as it is scaling up. Otherwise it’s a recipe for customer disruption and an ever-growing infrastructure bill!
- Pool of Resources: This phrase implies that you are intending to horizontally scale by adding new nodes rather than vertically scale by adding more RAM, CPU, etc. It is incredibly hard to auto scale vertically and equally, horizontal auto scale requires a particular kind of application design.
- Without Manual Intervention: This seems obvious; it suggests that the process to scale up or down can be triggered automatically. This is the element that typically defines the quality of auto scale and can be the difference between success and failure in a scaling situation.
Successful auto scale is all about the when
Setting the trigger point for an auto scale workflow is critical. Too early and you’re paying for more infrastructure than you need. Too late and your application may struggle to cope with the increased demand.
There are two sides to optimizing scaling: One is to decrease the time required to “spin-up” new infrastructure. The other is to find early indicators in your monitoring data that maximize the time before disruption occurs.
Most auto-scale implementations use very rudimentary points of measurement for triggering auto scale, such as CPU load average, RAM utilization, or basic HTTP availability checks. Those are blunt instruments where a much more targeted and surgical approach is required.
CPU load, for example, varies across the day based on user load, but also due to scheduled jobs like patching and backup. High CPU usage does not mean you need to scale, it doesn’t even mean your application is down for your users. On the other hand, high CPU load may also be the final warning before a system crash is imminent. It is not the right metric to use when deciding when to start an auto scale event.
When considering an auto scale strategy we consider the following points:
- Severity of spike: Is this a Reddit spike or a casual increase? One million extra users in 10 minutes is different than a million extra users over four days.
- Time to provision: Benchmark and test the time it takes to create new infrastructure. How long does it take for a Web server to be built and serve content?
- Per node capacity: Load test individual nodes and your baseline platform to better understand how much natural elasticity is in your platform before scaling is needed.
Finally, consider the cost/benefit of improving your scaling strategy. If it costs $X in terms of time and resource to improve scaling by 20%, does that pay for itself in users that you can serve during a spike versus having them bounce out until your solution catches up? The difference in this metric between brochure-ware sites and e-commerce operations should be factored in before embarking on a potentially expensive engineering project.
Rackspace and New Relic
At Rackspace, we use New Relic to help our customers build smart triggers for their scaling events by understanding the profile of their application and the things the business cares about, including error rates, transaction/page load times, and total connections. With these kinds of measurements combined with classic CPU/RAM/disk monitoring we can build more accurate pictures of a customer environment and understand whether scaling is needed and how aggressive it needs to be.
On the flip side, when this same rich set of measures show stable traffic, we can then work out when and how to scale services back down as quickly and safely as possible—lowering infrastructure costs in-line with usage.
Since the launch of Rackspace’s DevOps Automation Service earlier this year, we have found that we can spend a lot of engineering time on optimizing provisioning time of new infrastructure, but that the easiest and quickest way to improve customer scaling is to build more intelligent triggers based on real application metrics. We get a plethora of potential data points from New Relic, which helps us serve customers across very broad application backgrounds and workload types.
For more information on the Rackspace DevOps Automation Service, please visit http://www.rackspace.com/devops