From the infrastructure layer to the application’s front end, your system has limits. And when you push it beyond its thresholds, your system behaves in unexpected ways. Services degrade. Outages occur. And although we often think of incident response in terms of finding and correcting the root cause of the problem, in many cases, there isn’t a single cause. A sudden spike in demand for a service can lead to a large number of requests building up in a message queue, which is normally not a problem. But combine that with a misconfigured parameter that prevents the message queue cluster from scaling up resources, and you’re left with a degraded and possibly disrupted service.

While specific behaviors may be difficult to predict, SRE and DevOps teams can plan for macro-level issues such as saturation, long latency, and excessive workloads. Part of that planning should include designing for graceful degradation of services, which allows for more limited functionality while avoiding catastrophic failure. Networks engineers, for example, design networks to re-route traffic if a path is no longer available. This may lead to higher saturation on other routes and longer latencies, but the traffic eventually reaches its destination. Large scale enterprise applications can similarly benefit from designing for graceful degradation.

When designing your systems for resiliency and graceful degradation, consider which of these four established practices work best for your services, starting with the most to least impactful on your users:

  1. Shedding workload
  2. Time shifting workloads
  3. Reducing quality of service
  4. Adding more capacity

Let’s take a closer look at each one:

1. Shed the workload

When demand on a power generation facility exceeds capacity and risks damaging the entire system, engineers shut down power in parts of the grid. This is known as load shedding and is an apt analogy for addressing excessive load on a distributed system. For example, when requests for API calls, database connections, or persistent storage exceed current capacity, some of those requests are dropped.

A key design consideration with load shedding is deciding which requests to drop. The simple approach is to drop all requests to a service when a threshold is exceeded. While relatively easy to implement, this approach doesn’t distinguish between different levels of service or the priority of a particular request. For example, health checks should be given priority over other requests to a service.

2. Time shift the workload

Shedding load may not be an option for some services. If you run a wildly successful marketing campaign and new customers flood your e-commerce site, you don’t want to drop order transactions. A better option is to time shift the processing of the excessive workload. This technique decouples the generation of a request from the processing of that request.

Message queues, such as Apache Kafka and Google Cloud Pub/Sub, are widely used to buffer data for this kind of asynchronous processing. However, you should consider the scope of transactions when using this technique. Each decoupled service may implement transactions, but if a series of service calls must succeed or fail together, you may need additional logic to ensure transactions that span multiple services are rolled back correctly.

3. Reduce the quality of service

When you don’t want to shed or timeshift workloads but still need to reduce the load on the system, you may be able to reduce the quality of service. For example, if your system becomes stressed, you could temporarily reduce which features are available or switch to approximate database queries instead of deterministic queries. An advantage of this approach is that you can still service all requests instead of dropping some. This strategy also helps you avoid the time delays associated with time shifting.

4. Add more capacity

Ideally, from a customer experience POV, it’s best to not shed workloads, timeshift workloads, or degrade services. Adding capacity, on the other hand, is often the best choice for dealing with spikes in workloads. Cloud providers allow you to autoscale VMs and other infrastructure components, and Kubernetes can automatically scale pods in response to changes in workloads. These operations rarely require human intervention unless the pool of available resources is exhausted; for example, a failure in a zone within a public cloud can lead to a sudden demand for resources in the remaining zones within the region.

In the end, some advanced capacity planning and proactive monitoring can help your teams create more resilient services.

Expect the expected

The ultimate promise of modern software is 100% uptime, but SRE and DevOps teams have to prepare for the workload patterns and complex failure modes in their systems that are unavoidable—yet expected. If you take these factors into account when designing your applications, you’ll be more prepared to build systems that are resilient and capable of degrading gracefully. Hopefully without ever affecting your uptime.

Beyond graceful degradation, learn how to monitor your full cloud experience in our ebook, Adopt, Experiment, and Scale: Core Capabilities for Cloud Native Success

Dan Sullivan is a Principal Engineer and Developer Advocate at New Relic. He specializes in data and cloud architecture. View posts by .

Interested in writing for New Relic Blog? Send us a pitch!