The Alerts team at New Relic has two high-level requirements in managing the alerts pipeline: timeliness and accuracy. We must ensure that alerts are delivered in a timely manner, and we must accurately recognize violations (no false positives/negatives). Late detection of an issue is not acceptable, and neither is waking up our users in the middle of the night with a false alarm. Because of these essential concerns, we must be vigilant in our actions; the smallest change to our pipeline can have aggravating consequences.

Typically, when people think of things that affect software applications, they think of code changes. But what about code deployments? While code deployments are necessary to the health and improvement of any system, there is probably no other intentional action that you take on a recurring basis that deliberately causes service disruptions. It is how you deal with these intentional disruptions that makes all the difference.

If you make regular code deployments to an active pipeline, and need to limit service disruptions for users as well as upstream and downstream services, you may want to make use of what we call “rolling” deployments.

The core challenge: alerts are two dimensional

Before I go any further, here is a high-level overview of the alerts pipeline:

diagram of the alerts pipeline

User data from New Relic comes into the alerts pipeline on Kafka topics, and we evaluate it based on the the user’s alert settings, sending notifications when appropriate.

Now, to help you understand our deployment scheme and how we mitigate disruption, I first need to explain that alerts are “two dimensional.” Essentially, the configuration of an alert covers both a threshold (a comparison of a value to a limit) and a period of time (the duration that the signal must be in violation before a notification occurs).

alerts pipeline threshold violation

Example of a violation that opened after X minutes of violating the condition and closed after being healthy for X minutes.

In order to provide this functionality, certain portions of the system have to maintain a stateful buffer of the data to be evaluated. As data arrives, it’s added to any already buffered data and then evaluated to find sequences that violate (or no longer violate) the configured threshold. If the sequence of stateful data is broken by a deployment, that could have undesirable consequences; for example, we could fail to detect violating sequences, fail to post notifications on time, or fail to detect that a violation has stopped.

Enter rolling deployments

A traditional process for deploying web applications typically works something like this:

  1. Shift new traffic away from the instance via a load balancer.
  2. Allow current connections to finish.
  3. Stop the old instance.
  4. Start the new instance.
  5. Add the new instance to the load balancer, or shift traffic to the new instance.

To mitigate the issues that we face in the alerts system, we use a process very similar to that of a traditional web application. We call this approach a “rolling deployment,” which is the process outlined above combined with a controlled shutdown of each service instance. This means that a deployment of any service in the system is done one instance at a time. As each instance is stopped by our deploy mechanism, we ensure that:

  1. The consumption of incoming data is stopped.
  2. All inflight processing completes.
  3. All current state is flushed to a shared memory cache.

Once we’ve stopped the instance, we can start the new instance in the same place. We repeat this process until all instances that make up the application cluster have been replaced with the new distribution of code.

No matter your deployment strategy, minimizing the disruption to the processing of traffic/data during a deploy of new code or the addition of new instances to the system is always crucial. In our case, we want to ensure that no more than one instance is lost at a time while maintaining the processing of data during the deployment and, therefore, not reducing the capacity of our system. Stopping many (or all) instances at once could affect alert detection.

What about upstream and downstream services?

Equally important to ensuring an uninterrupted flow of data during a deployment is knowing how the systems around the deployment target behave. In other words, how will the systems upstream and downstream from the system that is currently being deployed react to the deployment?

For instance, the alerts system uses Apache Kafka to provide a non-blocking, persistent stream of data between services in the pipeline. One feature of Kafka is that topics can be subdivided into partitions that are assigned to consumers to increase the parallel throughput from the topic. This feature has one small downside with regards to deployment: each time a consumer stops or releases its partition, Kafka rebalances the partitions among the remaining consumers to ensure that the data is processed. Losing one consumer is not a big deal. However, if you stop all n consumers in your pipeline from reading from the topic, Kafka will perform n partition rebalances, which can cause issues to any stateful system.

To combat the issue of data being reassigned during a deployment, we use the same controlled shutdown mechanism discussed above. Our goal is to ensure that any remaining instance that receives the partition from the stopped partition will be able to retrieve the up-to-date buffer that’s flushed when an instance is stopped.

Maintaining the quality of your application

As noted, a deployment of new code in the alerts pipeline is an incident that we intentionally cause. Disruption to our user’s data could result in a violation being missed, a violation being detected late, or a false alarm, depending on how the condition is configured. Our users don’t want to be woken up because of a false alarm, or worse miss a real alarm, so it’s critical that in any service disruption we create, we don’t lose track of the data. Using rolling deployments for stream processing systems has helped us minimize disruption without sacrificing the timeliness or accuracy of the evaluation and notification of alerts.

Many factors go into choosing the best deployment strategy for a given system. Starting from the premise that you’re disrupting your service can help guide you toward asking key questions to mitigate the impact of rolling out new code:

  • How will the deployment impact the ability of the system to operate?
  • Will you drop any data loss or connections?
  • Will upstream or downstream components be disrupted?
  • Do you need to change the application itself to better accommodate your deployment strategies?

The end result should be the ability to deploy as often as you like, increasing the value of your system to your customers with zero disruption. Treating your deployment strategy as just another feature of your service is a great way to maintain the quality and stability of your application.

Jonathan Pearlin is a lead software engineer at New Relic. He loves to apply his experience to solving problems of scale and growth, especially those involving near real-time processing. View posts by .

Interested in writing for New Relic Blog? Send us a pitch!