DevOps is all about improving the way teams work in order to ship software faster, more frequently, and with greater reliability. And that means being able to respond quickly when problems occur that may impact customer experience or service level objectives (SLOs).

As software teams modernize and adopt cloud-native technologies, there are now a lot more things to monitor and react to—a wider surface area, more software changes happening, more operational data emitted across fragmented tools, more dashboards, more alerts—plus increased pressure to find and fix incidents quickly, as well as prevent them from occurring in the first place.

As the volume of data increases, so does time required to understand problems and resolve them. Many Ops teams we talk with still spend too much time in reactive mode, constantly firefighting incidents, while never finding time to implement processes that allow them to identify problems before they cause outages or performance issues.

And response fatigue is real. Between noisy alerts and thousands of “unknown unknowns,” it’s still very hard to separate signals from the noise and quickly determine the root cause of incidents, let alone respond to issues proactively. Every minute that DevOps, SRE, and NOC teams have to spend interpreting their data to detect anomalies, or manually diagnosing and responding to incidents, has a real impact on SLOs, company reputation, and the bottom line.

The emergence of AIOps

In the last few years, a new category of technology has emerged that puts AI and machine learning (ML) in the hands of on-call teams so they can prevent more incidents and respond to them faster. Gartner coined the term “AIOps” (Artificial Intelligence for IT Operations) to describe this space. As Gartner has stated, AIOps uses AI and machine learning to analyze data generated by software systems in order to predict possible problems, determine the root causes, and drive automation to fix them.

AIOps works by “combining big data and machine learning functionality to analyze the ever-increasing volume, variety and velocity of data generated by IT in response to digital transformation. AIOps platforms enhance a broad range of IT operations processes including, but not limited to, anomaly detection, event correlation and root cause analysis (RCA) to improve monitoring, service management and automation tasks. The goal of the analytics effort is the discovery of patterns — clusters or groups naturally occurring in the data that are used to predict possible incidents and emerging behavior. These patterns are used to determine the root causes of current system issues and to intelligently drive automation to resolve them.”

— Gartner Research, Market Guide for AIOps Platforms

So how does AIOps fit in with monitoring? At New Relic, we believe AIOps capabilities are a key requirement for observability. By providing a connected, real-time view of all telemetry data in one place, teams can pinpoint issues faster, understand not only what caused an issue but why, and get context to quickly analyze and proactively take action on that data.

AIOps augments the value you get from monitoring by providing an intelligent feed of incident information alongside your telemetry, and applying AI and ML to analyze and take action on that data, so you can troubleshoot and respond to problems faster.

Use cases for AIOps

There are four main ways that DevOps, SRE, and on-call teams are putting AIOps to use:

1. Proactive anomaly detection

The first step in the incident response process is detecting potential problems in your software, before an issue hits production or impacts customer experience. AIOps tools automatically detect anomalies in your environment and trigger notifications to your monitoring solution as well as other tools where your teams collaborate and get work done, like Slack.

2. Event correlation & noise reduction

The next step in the incident response process is diagnosis. AIOps tools help teams prioritize and focus on the issues that matter most by correlating related alerts, events, and incidents, and enriching them with context from historical data or other tools in your stack. The most advanced tools utilize both machine-generated (i.e., time-based clustering, similarity algorithms, and other ML models) as well as human-generated decisions to power the correlation logic, and give you the ability to enable automatic flapping detection and suppress noisy or low-priority alerts.

AIOps tools also provide valuable context by classifying incidents based on the four SRE golden signals—latency, traffic, errors, and saturation—so you can more easily diagnose the root cause of an issue and determine how to resolve it.

3. Intelligent alerting & escalation

In addition to detecting anomalies and providing intelligence to diagnose incidents, AIOps tools can automatically route incident data to the individuals or teams best equipped to respond to them. Particularly for decentralized, distributed teams that have embraced self-service, this reduces toil by decreasing the number of noisy alerts sent to the wrong people and cutting the time it takes to route critical incident data to the right folks.

AIOps tools run ML models to evaluate data from your incident management and monitoring tools and suggest an individual or a team that can resolve a particular problem faster, because either they’ve already seen something similar in the past or are experts at the specific components that are failing.

4. Automated incident remediation

The last, and most critical, step of the incident response process is actually fixing the problem. This includes workflows and automation to resolve the incident when it occurs, and reduce mean-time-to-resolution.

As on-call teams look to close the gap between detecting a problem, diagnosing it, and fixing it, the scope of AIOps is increasing to solve these last-mile challenges through automatic remediation capabilities.

How New Relic can help

As the complexity of operating production systems increases, software teams need faster and easier ways to resolve incidents. They need assistance and automation that augments their existing incident management teams and workflows, so they can find and fix problems faster. Our customers also have shared that they’re looking for AIOps solutions that are easier to onboard, learn, and use.

That’s why we recently announced New Relic AI, an AIOps solution that helps busy DevOps and SRE teams find, troubleshoot, and resolve problems faster. New Relic AI empowers your team to cut toil, get out of reactive “firefighting” mode, and return to the creative, challenging, and exciting work of building and running great software.

Unlike incident management tools alone or other approaches to AIOps, New Relic AI utilizes its access to raw monitoring data to fuel ML models and enable an intelligent, context-rich, incident response workflow.

By deeply integrating with the incident management tools you already use, we bring intelligence to your existing incident response process and workflow to provide the fastest-time-to-detection and noise reduction without reinventing your DevOps process.

If your team is looking to detect, diagnose, and resolve incidents faster with the help of an AIOps solution that’s easy to learn and use, learn more about New Relic AI.

Guy Fighel is the General Manager of applied intelligence and Vice President of product engineering at New Relic. He leads New Relic’s AIOps product and engineering, and is responsible for the company’s overall artificial intelligence and machine learning strategy. Guy was the co-founder and chief technology officer of SignifAI, an event-intelligence company, which was acquired by New Relic in 2019. View posts by .

Interested in writing for New Relic Blog? Send us a pitch!