One way our Dynamic Baseline Alerts, currently in limited release, do that effectively is by automatically sensing anomalies and keeping them from unduly biasing your baseline. Our approach to incident recognition is built to instantaneously identify anomalies and then reduce, or dampen, their influence on the baseline forecast. Competing approaches, meanwhile, often make you fuss with manual configurations or sub-optimize the rest of your baseline to account for anomalies. Who wants to deal with that?
To see how it works, take a look at these two charts. Basically, we went from this:
Before DampeningTo this:
After dampeningYou can see that before dampening, the incident (the red line) causes the dotted line that represents the baseline line to jump dramatically, but the effect is significantly reduced after dampening.
The problem with anomalies
As the team looked at our first builds of the baselines we wanted to use for Dynamic Baseline Alerts, we saw that a single incident, especially an anomaly that was significantly different than typical behavior, could dramatically change what the baseline considers “normal,” even though it really shouldn’t.
If your system has a sudden abnormal spike, you don’t want your prediction for that metric to consider that typical behavior and adjust the baseline in response. For example, look at February 6 in the “Before Dampening” image above—a sudden spike in the value causes the baseline to increase quickly.
We could have handled this by allowing customers to exclude an incident by manually selecting a time window to exclude from the baseline calculation, or let users pick a “typical” period and then calculate a baseline using just that known “good” period without those kinds of anomalies. This manual approach would be more of a hassle for users, and potentially not as accurate. Another approach would have let customers select an algorithm biased in favor of smoothing out anomalies, but that might sacrifice accuracy as well. More to the point, any of these options would have meant more fussy work for our customers. At New Relic, our job is to make the lives of engineering and operations teams easier, not harder. So our team came up with a different solution.
Remember that baselines are predictions. Based on all the metric data we’ve observed to date, we predict the next value for the metric using a mathematical model called triple exponential smoothing, which evaluates three factors (recency, trending, and seasonality) to create the next predicted value in the baseline.
But we actually have more information we can apply to making our predictions. Because baselines are a statistical model, we know our forecast error. If an observation lies significantly outside of the forecast, even when considering the error range, we can identify it as an outlier and decrease its weight when calculating the next prediction.
To automatically build in incident recognition, we’re applying a special case of a Kalman Filter, an algorithm commonly used to weight time-series data based on the certainty of the predictions. The filter acts as a “cleaning function” to dilute the effect of the outlier. Essentially, we set a boundary on the observed value using the recency portion of our model. If the observed value lies outside of the expected error range, we replace it with a fixed value. The overall model still adapts the baseline based on observations, but now there are guardrails in place to prevent wild swings based on individual incidents.
This technique automatically identifies outliers and dampens their influence on future predictions. This has the additional benefit of smoothing the baseline in general, as you can see in the Before and After charts above. Our customers can use Dynamic Baselines with even more confidence and we’ve made their jobs a bit easier.
Dynamic Baseline Alerts are currently in limited release and will be more widely available later this year. We think our approach to incident recognition in our Dynamic Baselines helps make them more useful for you. And remember, this is just what we’ve done recently. The data nerds at New Relic are always looking for ways to improve. Who knows what they will come up with next?