Momentum keeps building around Kubernetes. Our recent survey on serverless dynamic cloud computing and DevOps revealed that by mid-2018, more than 60% of respondents will be evaluating, prototyping, or deploying to production solutions involving container orchestration. And we see Kubernetes emerging as the de facto standard for container orchestration.

Kubernetes is helping a wide variety of businesses automate the deployment, scaling, and management of their applications. To make that possible, there is no shortage of new solutions being released that make it easier and easier to create Kubernetes clusters. More and more public cloud offerings (such as Azure Container Server and Google Container Engine) provide services that manage the underlying infrastructure of Kubernetes clusters, providing functions like cluster resizing, virtual machine creation and destruction, and automatic software updates.

With this in mind, we wanted to explore some strategies for instrumenting applications running inside these clusters. This post demonstrates how to use the New Relic application performance monitoring (APM) agent API to add an orchestration layer to application monitoring data for exploring performance issues and troubleshooting application transaction errors.

Improving application monitoring within the orchestration layer

By design, applications typically aren’t aware if they’re running in a container or on an orchestration platform. A Java web application, for example, doesn’t have any special code paths that would get executed only when running inside a Docker container on a Kubernetes cluster. This is a key benefit of containers (and container orchestration engines): your application and its business logic are decoupled from the specific details of its runtime environment. If you ever have to shift the underlying infrastructure to a new container runtime or Linux version, you won’t have to completely rewrite the application code.

Similarly, New Relic APM’s language agent—which instrument the application code to track rich events, metrics, and traces—doesn’t care where the application is running. It could be running in an ancient Linux server in a forgotten rack or on the latest 72-CPU Amazon EC2 instance. When monitoring applications managed by an orchestration layer, however, being able to relate an application error trace, for instance, to the container, pod, or host it’s running in can be very useful for debugging or troubleshooting.

You always want to be able to answer that all-too-common question: Is the problem in the code or the underlying infrastructure?

Application instrumentation with the Kubernetes Downward API

Traditionally, for any application transaction trace you collect using New Relic APM, our agent can tell you exactly which server the code was running on. In many container environments, though, this gets more interesting: the worker nodes (hosts) where the application runs (in the containers/pods) are often ephemeral—they come and go. It’s fairly common to configure policies for applications running in Kubernetes to automatically scale their host count in response to traffic fluctuations, so you may investigate a slow transaction trace, but the container or host where that application was running no longer exists. Knowing the containers or hosts where the application is currently running is not necessarily an indication of where it was running 5, 15, or 30 minutes ago—when the issue occurred.

Fortunately, the Kubernetes Downward API allows containers to consume information about the cluster and pod they’re running in using environment variables instead of external API calls. Recently I created a small sample app to demonstrate how this works in a Node.js application (Feel free to fork this repo for your own use.)

To integrate with the Downward API, I edited the environment section of the application manifest, and defined environment variables that would give a containerized application access to the Kubernetes node name, host IP, pod name, pod namespace, pod tier, and pod service account it’s running in. For example, here are a few of those variables:

- name: K8S_NODE_NAME
  valueFrom:
    fieldRef:
      fieldPath: spec.nodeName
- name: K8S_HOST_IP
  valueFrom:
    fieldRef:
      fieldPath: status.hostIP
- name: K8S_POD_NAME
  valueFrom:
    fieldRef:
      fieldPath: metadata.name

From here, I use the New Relic Node.js Agent API to expose those environment variables as custom parameters for all application transaction traces in an express.js application:

// Middleware for adding custom attributes
// These map to environment variables exposed in the pod spec
var CUSTOM_PARAMETERS = {
    'K8S_NODE_NAME': process.env.K8S_NODE_NAME,
    'K8S_HOST_IP': process.env.K8S_HOST_IP,
    'K8S_POD_NAME': process.env.K8S_POD_NAME,
    'K8S_POD_NAMESPACE': process.env.K8S_POD_NAMESPACE,
    'K8S_POD_IP': process.env.K8S_POD_IP,
    'K8S_POD_SERVICE_ACCOUNT': process.env.K8S_POD_SERVICE_ACCOUNT,
    'K8S_POD_TIER': process.env.K8S_POD_TIER
};

app.use(function(req, res, next) {
  newrelic.addCustomParameters(CUSTOM_PARAMETERS);
  next();
});

Exploring application performance in Kubernetes with New Relic

To get the instrumentation in place, I deployed the updated application code to the Kubernetes cluster, and then the custom parameters from Kubernetes began to appear in the New Relic UI.

Here we get some error details; the transaction attributes shows us, among other details, the Kubernetes hostname and IP address where the error occurred:

Kubernetes pod metadata exposed as transaction attributes

Next, I used New Relic Insights to look at the same instrumentation to see the performance of application transactions based on pod names. To do this in Insights, I simply wrote the following custom New Relic Query Language (NRQL) query:

SELECT percentile(duration, 95) from Transaction where appName='newrelic-k8s-node-redis' and name='WebTransaction/Expressjs/GET//'FACET K8S_POD_NAME TIMESERIES auto

And here’s the result:

custom transaction query in Insights

The Downward API exposes a good deal of useful metadata for monitoring solutions. You can use it to gain useful information about performance outliers and track down individual errors. In aggregate, APM Error Profiles also surfaces any correlations between errors in transactions that have been instrumented with data you’ve gathered with the Downward API.

For instance, in this example, APM Error Profiles automatically notices that nearly 57% of errors come from the same pods and pod IP addresses:

pod errors shown in APM Error Profiles

APM Error Profiles automatically incorporates the custom parameters and uses different statistical measures to determine if an unusual number of errors is coming from a certain pod, IP, or host within the container cluster. From there, you can zero in on infrastructure or cluster-specific root causes of the errors (or maybe you’ll just discover some bad code).

Customers don’t care about Kubernetes … but you must

Customers don’t care if you’re using traditional virtual machines, a bleeding-edge multi-cloud federated Kubernetes cluster, or a home-grown artisanal orchestration layer—they just want the applications and services you provide to be reliable, available, and fast. The core business responsibility of DevOps teams is to deliver good software that’s unaffected by any changes to your platforms, tools, languages, or frameworks.

However, with the rise of Kubernetes and new technical concepts like orchestration layers, teams need new contexts for understanding and exploring performance in their applications and code. We believe that to efficiently build modern software, you need visibility into the application layer, especially when those applications are running inside Kubernetes.

Like many of our customers, we’re always seeking solutions to take new and existing applications even further. By experimenting with these kinds of integrations between New Relic APM and the Kubernetes API, we’re working toward creating even deeper visibility into all contexts of the application layer.

 

Clay Smith is a Developer Advocate at New Relic in San Francisco. He previously has worked at early stage software companies as a senior software engineer, including founding the mobile engineering team at PagerDuty and shipping one of the first iOS apps written in Swift.

View posts by .

Interested in writing for New Relic Blog? Send us a pitch!