In today’s modern software world, creating production-ready software is about much more than shipping features. Building a “functionally complete” software system is barely half the battle. Gone are the days of shipping software as soon as it passes your QA team’s functional validation; systems must be built to much higher standards to survive in today’s market.

You must be prepared to deal with third-party dependency failures, to cope with malicious users, to scale your system as you add customers (you do plan to add customers, right?), and to otherwise meet your reliability service-level objectives (SLOs), indicators (SLIs), and agreements (SLAs).

One critical part of reliability, of course, is monitoring. If you don’t have visibility into the health of your system, you’ll know something is wrong only when customers call—or tweet—to complain (which is bad). And the only way you’ll find the precise problem is to fumble around blindly (which is very bad).

But when reliability experts tell you that you need to monitor the health of your systems, how do you know exactly what you need to monitor? Throughput? Response time? Latency? These are the most obvious choices, and these metrics can often indicate when you’ve got a problem, but they tell you almost nothing about what’s actually causing the problem.

Yep, that’s what lag looks like. But now what?

You’ve got all this data, yet you have no information. So … you need a different set of data.

In my four years at New Relic I’ve been responsible for the health of dozens of different services. Something they all had in common was that when they failed you could always see something going on in their resource pools.

You need to monitor your resource pools

Any non-trivial software system will have pools of resources that are ready to do work as requests arrive. Talking to a database requires a pool of database connections. Processing work from a queue requires a pool of threads. The work queue itself is also a pool, albeit one that fills up instead of drains. (Consider also that a single “non-pooled” connection is effectively just a pool with a single connection.)

All streaming systems, made up of any number of services, are a series of resource pools. Even if your service, say a simple windowing data aggregator, doesn’t talk to any databases or make any external requests, just reading from and writing to your message broker involves a number of threads and buffers.

HTTP services are no different. An ASP.NET application running on Microsoft Internet Information Services (IIS), for example, does request queueing, which is just a pool of requests waiting to be handled by a pool of request threads.

The sizes of resource pools are easy to measure, and this can be valuable data. When something is going wrong with your system, symptoms will inevitably show up in one or more of your resource pools.

Let’s consider an example.

Monitoring the agent state downsampler

The agent state downsampler is a fairly simple service, running on Apache Kafka, that reduces the amount of data flowing from the language agents our customers have installed in their apps to our downstream consumers. It takes in a high-volume stream of agent metadata and produces only one message per hour per agent. It uses Memcached to keep track of which agents have already had a message in the last hour.

The agent state downsampler

So, how do we monitor this thing? Let’s start with those obvious bits we mentioned earlier: throughput, processing time, and lag.

Monitoring the agent state downsampler

This looks like some good data. But what happens to these charts if the downsampler starts lagging? We’ll see throughput plummet and processing duration and lag spike up. Great, but then what? By itself, this data can’t tell us anything other than “something’s wrong,” which is great for alerting purposes but does nothing to help us uncover the actual cause of the problem. We need to go deeper.

Here is a slightly more detailed look at the agent state downsampler:

Detailed view of agent state downsampler

With this enhanced view of the service we can think more critically. When something goes wrong, the first question we should ask is, “How full are our queues and buffers, and how busy are our thread pools?”

Here is a partial list of lag scenarios we can quickly diagnose by monitoring our resource pools:

SymptomsProblemNext Steps
Throughput is down and the Memcached thread pool is fully utilizedMemcached is down/slowInvestigate the health of the Memcached cluster
Throughput is down and the Kafka producer buffer is fullThe destination Kafka brokers are down/slow
Investigate the health of the destination Kafka brokers
Throughput is down and the work queue is mostly emptyThe source Kafka brokers are down/slow, and the consumer thread isn’t pulling messages fast enoughInvestigate the health of the source Kafka cluster
Throughput is up and the Kafka producer buffer is fullAn increase in traffic has caused us to hit a bottleneck in the producerAddress the bottleneck (tune the producer, possibly by increasing the buffer) or scale the service

A proven technique for monitoring resource pools

The first thing you need to do is collect the data about your resource pools. As mentioned earlier, this is actually pretty simple: set up a background thread in your service whose only task is to routinely measure the size and fullness of each of your resource pools. For example, ThreadPoolExecutor.getSize() and ThreadPoolExecutor.getActiveCount() will tell you how large a thread pool is and how many threads are busy.

Here’s a simplified example using Guava’s AbstractScheduledService and Apache’s HttpClient libraries:

public class ThreadPoolReporter extends AbstractScheduledService {
   private final ObjectMapper jsonObjectMapper = new ObjectMapper();

   private final ThreadPoolExecutor threadPoolToWatch;
   private final HttpClient httpClient;

   public ThreadPoolReporter(final ThreadPoolExecutor threadPoolToWatch, final HttpClient httpClient) {
        this.threadPoolToWatch = threadPoolToWatch;
        this.httpClient = httpClient;
    }

    @Override
    protected void runOneIteration() {
        try {
            final int poolSize = threadPoolToWatch.getPoolSize();
            final int activeTaskCount = threadPoolToWatch.getActiveCount();

            final ImmutableMap<String, Object> attributes = ImmutableMap.of("eventType", "ServiceStatus", "timestamp", System.currentTimeMillis(), "poolSize", poolSize, "activeCount", activeCount);
            final String json = jsonObjectMapper.writeValueAsString(ImmutableList.of(attributes));

            final HttpResponse response = sendRequest(json);
            handleResponse(response);
         } catch (final Exception e) {
             NewRelic.noticeError(e);
         }
    }

    private HttpResponse sendRequest(final String json) throws IOException {
        final HttpPost request = new HttpPost("http://example-api.net");
        request.setHeader("X-Insert-Key", "secret key value");
        request.setHeader("content-type", "application/json");
        request.setHeader("accept-encoding", "compress, gzip");
        request.setEntity(new StringEntity(json));
        return httpClient.execute(request);
    }

    private void handleResponse(final HttpResponse response) throws Exception {
        try (final InputStream responseStream = response.getEntity().getContent()) {
            final int statusCode = response.getStatusLine().getStatusCode();
            if (statusCode != 200) {
               final String responseBody = extractResponseBody(responseStream);
               throw new Exception(String.format("Received HTTP %s response from Insights API. Response body: %s", statusCode, responseBody));
            }
        }
    }

    private String extractResponseBody(final InputStream responseStream) throws Exception {
        try (final InputStreamReader responseReader = new InputStreamReader(responseStream, Charset.defaultCharset())) {
             return CharStreams.toString(responseReader);
        }
    }

    @Override
    protected Scheduler scheduler() {
        return Scheduler.newFixedDelaySchedule(1, 1, TimeUnit.SECONDS);
    }
}

You want to check the thread pool’s stats pretty often (I recommend once per second) so that you have good data granularity.

After you set up the background thread, you’ll be able to visualize the data in New Relic Insights. Here’s what the query might look like:

SELECT histogram(activeTaskCount, width: 300, buckets: 30) FROM ServiceStatus SINCE 1 minute ago FACET host LIMIT 100

You can view the data as a line chart (like the throughput and lag charts shown above), and that’s a fine choice. However, I prefer viewing my resource pool utilizations as two-dimensional histograms (or heat maps) because it makes it easy to see when something is amiss.

Agent state downsampler active workers

You can see that during “normal” operation, our thread pools are mostly idle; we like to have a lot of head room for traffic bursts. If the dark squares start creeping to the right, it’s a clear signal that something is going wrong.

Repeat this process of adding monitoring code for each of your resource pools. Consider combining the data from each of the pools into a single Insights event if you want to reduce the number of events you’re storing.

Finally, build an Insights dashboard to bring it all together. The image below shows our full agent state downsampler dashboard—one glance is all it takes to see if anything is wrong in our service or resource pools.

Dashboard for monitoring the agent state downsampler

It’s all about being proactive!

Every system I’ve worked on at New Relic has benefited from having resource pool monitoring in place, but the high-throughput streaming services have benefited the most. We’ve diagnosed quite a few nasty issues in record time.

For example, recently we encountered a crippling problem in one of the highest throughput streaming services in the New Relic ecosystem; it halted all processing. It turned out to be an issue with the Kafka producer not having enough buffer space, which would have been very difficult to figure out without monitoring. Instead, we were able to pull up the service’s dashboards, read through the Kafka producer charts, and immediately see that the buffer was completely full. We reconfigured the producer with a larger buffer and we were back in action within minutes.

Monitoring gives you the ability to solve problems before they happen. Glance through your dashboards not just during incidents, but on a regular cadence (say, once a week) and look for historical trends. If you see your thread pool utilization slowly increasing, scale the service before it starts lagging and avoid a potential incident. You can even use New Relic Alerts to set up low-priority email or slack alerts to proactively warn you when you’re approaching your resource limits.

Evan Nelson, a senior engineer, has worked at New Relic since 2014. He has a passion for building high-throughput streaming systems and solving the challenges that come with them. View posts by .

Interested in writing for New Relic Blog? Send us a pitch!