At New Relic we have a companywide initiative to move all our applications and services into our container framework. Called Container Fabric, it’s a homegrown container orchestration and runtime platform. That’s why the New Relic Mobile application team has been busy moving our application stack from traditional hardware, which depended heavily on Capistrano deployments to bare metal servers, into Docker containers on the Container Fabric platform.
As we got started with Container Fabric, we saw no immediate issues with auto-scheduled deployments, and our HTTP(S) and REST APIs seemed to work well inside containers. But because of memory imbalances between the Java Virtual Machine (JVM) and Apache Kafka, which we use as our data-streaming platform, the Mobile data ingest pipeline presented a unique challenge as we worked to containerize the full system.
Here’s how we did some JVM troubleshooting and overcame those memory imbalances to achieve better management of our containerized data ingest pipeline.
The New Relic Mobile data ingest pipeline
Before I go further into the problem, let me quickly explain how the data ingest pipeline works. Consider this graphic:
Using the SDK for New Relic Mobile, mobile developers instrument their applications to send payloads from the mobile client (running on iOS or Android) to the Mobile router, which publishes to a Kafka cluster. The router serves the function of a Kafka producer, essentially starting the Kafka stream. The data payload is then added to a Kafka topic (a record of the data) and is divided into topic “partitions” for load balancing and replication.
The Mobile consumer reads the data stored in the topics. The Mobile application reads that data and sends it to various parts of the New Relic ecosystem (for example, the web UI for New Relic Mobile or to New Relic Insights). We store all of this data as events in our NRDB database cluster. The producer and consumer applications run on the JVM in a Docker container, and the container sits on a larger application host.
JVM troubleshooting: diagnosing the problem with Kafka
To fully understand the issues between the Kafka in the container and the JVM, you need to know a few key terms related to the JVM:
- Heap: how Java manages memory by allocating space for new objects by moving unused objects.
- Garbage collection: how Java identifies objects in use and deletes those that are not.
- Maximum transmission units (MTU): sets the largest size of data packets that can be transmitted over a network.
When we deployed a test version of our ingest pipeline with Container Fabric, we noticed that the consumer was not properly processing the data. The JVM was struggling in a few critical places.
Our high-throughput Kafka consumers ingest data at very high rates, and as new data came in, the JVM’s heap was not growing as needed. Using the New Relic Java agent, we discovered that the heap was growing to a minimum size (Xms), but it never grew to the maximum (Xmx). This caused the JVM to become resource starved, at which point it created an Out of Memory Error (OOM) as the heap ran out of space.
We found that setting the heap’s Xms to match the Xmx allowed our heap to snag the resources that we had already committed to it. Since the Container Fabric has hard resource limits, underutilizing the resources we already committed and trying to coax the heap into growing organically didn’t work.
We also saw issues with MTUs between the container and the underlying host. The JVM would successfully launch and take ownership of the partition but large data packets could not be transferred. We worked with our Container Fabric team and discovered that the MTU set on the host was not the same as the MTU set on the container. So small packets (such as ownership changes) would be transmitted correctly but large packets would not be transmitted from the host to the container.
Essentially, we weren’t ingesting any payloads or client data.
Taking control of the memory
Once we realized our pipeline was failing because of heap and garbage-collection issues, we adjusted the Xms and Xmx heap sizes as much as we could. This made the amount of memory the JVM used more flexible, and ensured our JVM consumed all the resources that we granted to it. This presented a true view of our JVM process so we could tune our Xmx values and memory resources to avoid OOM kills and underutilization of resources.
(Note: If you’re handling any kind of data transfer, you want to be sure the packets can get through your network at the size you want.)
Once we adjusted for these problems we were able to move the data ingest pipeline totally off our traditional hardware stack and into containers. We now can set how many instances can be deployed and horizontally scale our ingest pipeline as our product is adopted.
Having to solve—or plan for—JVM heap and garbage collection issues certainly isn’t work unique to the New Relic container orchestration platform. So we’ve shared our story to help you get ahead of your own JVM memory issues.