Operations professionals, site reliability engineers, and anyone else who lives and breathes the practice of systems monitoring, prides themselves in knowing what to monitor, both through experience and the shared knowledge of like-minded professionals. With expertly crafted alerts, thresholds, dynamic baselines, and other elegant solutions, they deftly defy the gremlins that stalk uptime and performance goals using data collected at intervals typically ranging from 60 seconds to 15 minutes.
What if there was a better way? Josh Biggley is a monitoring practitioner with decades of experience across market segments and customer verticals. After serving as a SolarWinds Community MVP for five years, Josh joined New Relic with a focus on enabling fellow Ops engineers to level up their observability.
This two-part blog series explores why you’d want to improve your observability with SolarWinds, and how to do it.
Our world has changed, and while monitoring is still part of what we do, it is simply a waypoint on our observability journey, not the destination itself. In spite of our best efforts to anticipate problems, modern systems fail in complex, unknown, and sometimes spectacular ways.
We continue to focus on KPIs to measure our success, striving to improve our mean time between failures (MTBF), mean time to detect (MTTD), mean time to resolution (MTTR), and stay within our SLO error budgets. That improvement is enabled, not by monitoring, but by observability—the ability to, paraphrasing Sir Isaac Newton, see further by standing on the shoulders of giants. Extending our vision has a very real impact to the business with Gartner estimating $5,600 per minute in downtime costs on average.
But wait, what exactly is observability? In the ebook, The Age of Observability, Uber Technologies Engineer Yuri Shkuro’s views on observability are summarized, saying:
“Monitoring is about measuring what you decide in advance is important while observability is the ability to ask questions that you don’t know upfront about your system.”
Monitoring tells you that something went wrong, observability empowers you to see where and why.
When I began the monitoring practitioner phase of my career with a focus on the SolarWinds Orion platform, I did not understand that metrics were only part of observability. A mentor on my path taught me a very important and difficult lesson: “I was doing monitoring wrong.” As much as I bristled at that message, it was something I needed to hear and that we all need to hear. If I was doing it wrong, what was the right way?
Here is how I leveled up my monitoring as part of the observability journey.
1. Democratized access to data
While this might seem like a grandiose ideal that channels a more radical, youthful spirit, it is absolutely critical for modern business operations. Did you notice that I didn’t say “IT/Network/Systems operations?” This is about making systems performance data available to a broad audience within an enterprise. Understanding how system performance impacts customer satisfaction, how network latency drives backend latency, or how customer demand is met by a scalable platform improves insights and decision making. Or, as the network admins emphatically declare: “It’s not the network and I can prove it! (And, if it is the network, we need more bandwidth!)”
2. Metrics, events, logs, and traces
Metrics, events, and logs are the foundation of all of troubleshooting. Every systems admin, network engineer, or would-be troubleshooter knows the tools of the trade: PerfMon, Windows Event Logs, /var/log/messages, top—and the list goes on. We become infinitely more efficient when we can see those metrics, events, and logs across multiple systems.
In the modern application stack, we may not have direct console access and are most certainly going to be asked: “What is the impact on our customers?” Having a view of the complete stack—compute to customer, and all parts in between—elevates our view from monitoring to observability.
3. Consolidate data
When we talk about consolidating data into a single platform, the question engineers and operations teams often ask is “Why?” In some cases, the answer is strictly financial—why pay for two platforms to do the same function? But the better, more elegant answer is to establish a single source of truth. If you’ve ever built out a configuration management database (CMDB), you know the siren song of a trusted, unified, and current data repository. Data consolidated from disparate sources unifies that telemetry, eliminates the battle between tools and teams, and drives improvement in MTTx measures.
4. Cardinality matters
The philosophical argument of Schrödinger’s cat has an equivalent within the practice of monitoring. If you collect data at a five-minute interval (or even 10m, 15m, or longer), what happens while you aren’t observing your environment?
The concept of real-time data collection is the happy place of data geeks everywhere, though often far from reasonable from a technical and financial perspective. Collecting data more often closes the “unknown” gaps in our observability. Layering data from multiple sources and inferring the state of an environment is the essence of observability: “a measure of how well internal states of a system can be inferred from knowledge of its external outputs.” (Check out Proactive Detection in New Relic AI for an excellent example of observability at play.)
Shifting data collection to a high-cardinality platform opens opportunities to improve native cardinality. For example, shortening the polling interval in SolarWinds NPM decreases the total number of elements that can be assigned to a single polling engine and the overall Orion platform. By moving the collection of OS performance and process metrics to New Relic Infrastructure, where collection can be tuned as frequently as 5s, it opens an opportunity to tweak the metrics collection on the remaining devices without investing in more compute, licensing, and storage.
5. Ask three questions
Every time we change our strategy, design a new environment, or, as in our case, look for ways to level up observability, we should ask ourselves three questions:
- “What can I save?” is often the easiest to answer. Reducing capital and operational expenses has an immediate impact on the budget and frees up room to shift those dollars to an initiative that aligns with the current business initiatives.
- “What can I gain?” requires more effort to measure but can have a much deeper impact on our successes. What is the value of improving customer satisfaction? How would your organization invest the time saved with faster MTTD and MTTR? What insights could you gain by sharing performance data widely within your organization? Gains may not be directly measured in dollars and cents, but there is no doubt that they drive real returns to your business.
- “What can I create?” is the most esoteric of the three questions, but is also the most exciting. Improving observability begins to ease the toil on application and infrastructure teams. Fewer hours spent fighting fires unlocks time to innovate, increase efficiency, and drive customer value. What you create is not a prescriptive outcome, it is driven by what your customers need, what your team defines as priorities, and what the markets demand.
There is a better way
Where does that leave SolarWinds Orion administrators who want to level up to observability? The answer is found in the New Relic platform.
As practitioners, we’ve grown accustomed to having data silos, enduring “swivel chair” troubleshooting, and getting the rush of adrenaline that comes from unraveling the complexity to solve a problem, often at 3 a.m.
Observability means bringing together the telemetry that enables insights and innovation across a broad spectrum of the enterprise. It means enabling teams to ask questions that they oftentimes didn’t even know they needed to ask, to find where a problem exists and then dive deep to discover why.
For monitoring practitioners, it means being empowered to consolidate data into a single platform that can ingest more than 2 billion events and metrics per minute, a platform that measures real user performance alongside application traces, and links that data with the infrastructure that supports it all.
It means answering the question “Is it the network?” not by blaming the network or shifting blame to database administrators (DBAs), developers, or the storage team, but by sharing data that network engineers have collected and trusted in the Orion platform with everyone. Consolidating telemetry on the New Relic platform means faster time-to-glass for data whether collected by NPM from your network infrastructure, natively by New Relic Infrastructure, or ingested with codeless custom integrations.
There is a better way to do monitoring. You can level up to observability and you can do it today.
Unifying data from the SolarWinds Orion platform and New Relic is not only possible, we have already built a simple process that enables any SolarWinds Orion administrator to decide what data stays in Orion, which data should be sent to New Relic, and which data should be collected natively by New Relic Infrastructure.
We’ve done the heavy lifting by building out the API, the SolarWinds Query Language (SWQL), queries, and by designing an easily mapped process for both sharing data and transplanting data collection.
Do you want to learn more about Observability for SolarWinds Orion, including how you can implement this solution today? Stay tuned for the second installment in our blog series and sign up for our upcoming webinar.