Back in May we introduced Real User Monitoring, which expanded visibility upwards to measure performance in the user’s browser. With the addition of Server Monitoring, we are expanding to look beyond the application itself and into the environment that it runs in.
Application developers and Ops teams already use many different tools for monitoring their servers. But too many of these try to simply report every available piece of information to the user. We take a different approach: our job at New Relic goes above simply providing information. We provide the right information. With that in mind, we designed Server Monitoring to help our users find answers to the same two fundamental questions that all Application Performance Monitoring tools should answer:
- Is my application (or server) performing well?
- If it isn’t, what can I do to make it better?
We want to give the user the data that they need to answer these questions in a variety of scenarios, but we need to be just as careful about what we don’t show. Every item on the screen that isn’t helping answer these questions is an obstacle that makes the right answer harder to find.
The Four Key Metrics
When a user is trying to answer the first question, they need to quickly determine the status of many servers at once. The Servers Dashboard page shows a list of every server currently reporting, along with the four key server metrics, and a graphical “traffic light” that tells you at a glance what the overall health of the server is. It can also be filtered by application or hostname if you already have a general idea of what you are looking for. This interface should be familiar to anyone who has used New Relic before, it is purposefully very similar to our Applications Dashboard page.
For an application, the key metrics are response time, throughput, and error rate. For servers, we needed to identify a set of similar key metrics, which can be used to immediately tell the overall health of a server. We chose CPU Busy, Disk Busy, Memory Used, and Disk Space Used. All four are shown as a percentage, where higher numbers are bad. This makes them easy and quick to compare across different servers.
CPU Busy and Disk Busy measure the percentage of the time that your system is using the CPU or performing Disk IO. As these numbers climb, processes will have a harder and harder time getting resources. This means that applications on this server will start to slow down. If the Disk or CPU becomes completely busy, it is very likely that those applications will become unresponsive.
By contrast, Memory Used and Disk Space Used won’t affect performance much until they start to run out, but when they do run out the consequences are severe. A climbing Memory number can cause other processes to be “swapped” out, which will have a devastating effect on performance. A server will run smoothly as disks fill up, but if the disks become completely full then you may experience many seemingly unrelated problems, especially if this server hosts a database.
Once a problem has been detected, it is time to dig deeper. If Disk IO is high, which disk is the one having trouble? If the CPU is busy, what is it busy doing? The starting point for any of these questions should be the Overview page for the server in question. Again, this page should be immediately familiar to New Relic users, since it follows the same design as our Application Overview page.
The purpose of this page is two-fold. On one hand, it needs to make it clear to the user when something is not right. To accomplish the first goal, we have used a consistent 100% max for the CPU, Memory, and Disk I/O charts, and have tried not to overload the user with too many charts.
This page is also the starting point for exploring a problem. To accomplish this second goal, we have a relatively dense right-hand column, and have included some context information in addition to actionable performance data. Of particular note is the Processes table, which functions like the “top” command and tracks the top 10 processes in terms of CPU and Memory usage. Often we don’t need to go any farther than this page when we are trying to diagnose a problem where either CPU or Memory resources are being exhausted.The Disk I/O and Network charts, and the Processes table are also jumping off points to their respective detail pages. On those detail pages you can see additional details (like disk Reads vs. Writes) broken up by specific disk device or network interface.
Try It Yourself
Like Real User Monitoring, we believe that Server Monitoring is a vital part of Application Performance Monitoring. So we are including it for free with all of our subscription levels. If you already use New Relic, just click on the “Servers” tab to get started, the install only takes a few minutes. If you don’t already use New Relic, try it out for free, we think you’ll like it.
For more information, check out these articles in our knowledge base:
Note: Right now Server Monitoring only supports Linux servers, and requires* root access to install, but keep an eye out for more platforms in the future.
(* This isn’t strictly true, but only advanced** users should attempt a non-root install.)
(** You know who you are.)