Here at New Relic, the Edge team is responsible for the pipelines that handle all the data coming into our company. We were an early adopter of Apache Kafka, which we began using to power this data pipeline. Our initial results were outstanding. Our cluster handled any amount of data we threw at it; it showed incredible fault tolerance and scaled horizontally. Our implementation was so stable for so long that we basically forgot about it. Which is to say, we totally neglected it. And then one day we experienced a catastrophic incident.
Our main cluster seized up. All graphs, charts, and dashboards went blank. Suddenly we were totally in the dark—and so were our customers. The incident lasted almost four hours, and in the end, an unsatisfactory number of customers experienced some kind of data loss. It was an epic disaster. Our Kafka infrastructure had been running like a champ for more than a year and suddenly it had ground to a halt.
This happened several years ago, but to this day we still refer to the incident as the “Kafkapocalypse.”
Sometimes in software things can go terribly wrong, but it’s how you bounce back that really matters.
Once we got everything back up and running and our customers squared away, we did a lot of soul-searching. We had years of experience monitoring complex and distributed systems, but it was clear that with Kafka, we didn’t understand all of the key indicators to monitor and alert on. In this post, I’ll share our key takeaways from that horrible day, and leave you with six tips designed to help DevOps teams avoid having to suffer through their own Kafkapocalypse.
What’s Kafka and how does it work?
This post is not going to deep dive into all the technical details about Kafka, but simply put, Kafka is a scalable, fault-tolerant messaging system. It’s used for building real-time data pipelines and streaming applications, and runs in production environments in companies like Netflix, Walmart, Twitter, and Uber.
Here’s a rundown of how it works:
- Producers send messages to topics, which are records of streaming data. The topics exist on Kafka servers, also known as Brokers.
- Consumers read the topic data from the brokers.
- The topics are divided up into partitions, and it’s these individual topic partitions that producers and consumers interact with.
- Topic partitions are distributed throughout your cluster to balance load. This means that when producers and consumers interact with topic partitions, they’re sending and receiving from multiple brokers at the same time.
Kafka supports replication natively. Each topic partition has a Replication Factor (RF) that determines the number of copies you have of your data. At New Relic we use RF3, so we have three copies of all data.
For each replicated partition, only one broker is the leader. It’s the leader of a partition that producers and consumers interact with.
The evolution of our Kafka stack
In 2014 our Kafka deployment had five brokers processing around 1 – 2 Gbps each, which roughly translates to processing one episode of your favorite HBO show per second. But as our customer base grew and the amount of data we were ingesting increased, we scaled our deployment to meet our needs.
As of this writing, we have more than 100 brokers processing more than 300 Gbps deployed across three different data centers, which is like processing the entire season of your favorite HBO show two times every second. In messaging terminology, this is roughly 15 million messages per second.
Obviously, this represents significant growth in volume and complexity.
It’s not really practical for us to monitor every single event happening in this system. If we tracked every message and notify on every event that happens in the cluster, we might lose our minds to information overload and alert fatigue. Instead, we’ve found a “middle path” where we’re able to have a healthy, performant cluster and get a full night’s sleep.
So what exactly is this middle path?
Monitoring Kafka while maintaining sanity
We have three very important things we now monitor and alert on for our Kafka clusters:
- Retention: How much data can we store on disk for each topic partition?
- Replication: How many copies of the data can we make?
- Consumer Lag: How do we monitor how far behind our consumer applications are from the producers?
Monitoring Kafka while maintaining sanity: retention
Here’s another quick story …
A couple months ago we had a consumer lag incident on a high-volume topic: we weren’t processing data as fast as it was arriving, so it was buffering in Kafka. At its peak we were lagging 8 or 9 minutes behind real time; it wasn’t another Kafkapocalypse but it was serious enough.
The topic in question was configured to retain up to 2 hours of data, but when we calculated the storage based on the increased data rate, it turned out we were storing a lot less data than we should have been.
In this case, we weren’t thinking about retention in terms of space vs. time.
In Kafka you can configure retention settings by time (minutes, hours, etc.) and by space (megabytes, gigabytes, etc.).
If you configure only by time, say to retain 24 hours of data, and the volume increases, 24 hours of data could turn out to be more than you expect and the server could run out of space. Kafka won’t like that.
And if you configure only by space, an increase in volume could shorten the interval of time you’re keeping on disk; suddenly you have only 10 minutes worth of data on disk when you should have 2 hours.
So space vs. time—how do you pick which one to configure?
The quick answer is you don’t choose one; you configure both.
Let’s consider DevOps Tip #1: Configure both space and time retention settings.
Keep in mind, though, that while having a cap on space protects you from overwhelming the server, an increase in volume can still make it so you hold less data than you think.
To protect against this, we’ve written code that monitors data rates of our partitions, size on disk, and the current topic configurations and sends that data to New Relic Insights. We combine that with New Relic Alerts to warn us if we’re getting close to our time thresholds.
This Insights query shows the ratio of actual bytes on disk compared to the configured maximum bytes for the topic, by topic partition. If the ratio gets above 100, we can delete data by size instead of by time, which is how we to handle retention. New Relic makes it super easy to set up alerts on these queries.
This brings us to DevOps Tips #2 and #3.
DevOps Tip #2: Keep as little data as you can while still maintaining business goals and obligations. You’ll be able to move data around quickly during partition reassignments, and getting a replacement broker in sync if an old one fails will be far less painful.
DevOps Tip #3: Have a quick way to increase retention. (This is actually a hedge against the first tip.) Our team created a feature we call the Big Red Retention Button, which allows other New Relic teams to instantly increase the retention of topics they own. If there’s an incident in the middle of the night, they can protect themselves without waking up our team. Any retention change brought on this way all go through source control, so the increase can also be easily reverted after things catch up.
Nothing screams “DevOps” quite like self-service.
Monitoring Kafka while maintaining sanity: replication
Kafkapocalypse was a perfect storm of monitoring issues. One thing that made it so hard to resolve quickly was that we ended up in an under-replicated state—we’d lost our data redundancy. There were Kafka brokers we needed to restart, but couldn’t because they were the only brokers holding a particular set of data.
Replication monitoring had been a blind spot for us. Today, though, we monitor the hell out of replication.
If we drop from three copies to two on a single partition, we’re notified immediately. The alert message tells us exactly which broker and which partition are affected. We’re even notified about non-preferred replica assignments and broker-to-broker consumer lag events that are so brief the cluster itself doesn’t even change state.
However, alerting on replication can be a double-edged sword. Monitoring all the things means you can get notified about all the things. Most DevOps teams participate in pager rotations, so if you’re on one of these teams (or sleep next to someone on one of these teams), you know multi-tiered alerting is important.
We have at least three copies of all our data. If a server fails in the middle of the night, we send only an email or Slack message—after all, we still have two copies of the data and we can fix the single failure in the morning. However, if two servers fail, someone’s getting woken up.
Hardware breaks all the time and it’s critical to plan for that inevitability, a point that leads to DevOps Tip #4: Keep your replicated data in separate failure domains.
For example, if you are using replication factor 3 and keep everything in one data center—like we used to—at least keep your Kafka brokers that share partitions in different racks. You don’t want a single rack failure to put you in a single-point-of-failure situation.
Today our cluster is spread across multiple data centers. We keep every replica of a topic partition in a separate location; even if we lose an entire data center, we still have multiple copies of the data.
Given our replication needs, we also have to ensure that disk I/O doesn’t become a constraint.
Kafka’s throughput is limited by how fast you can write to your disks, so it’s important to monitor disk I/O and network traffic—and the two are usually correlated. You need to make sure that the number of topic partitions is evenly distributed among the brokers, but since all topics aren’t created equal, in terms of traffic, you’ll want to take that into account as well.
Balancing leadership refers to making sure the number of partitions that a broker is leading at any one time isn’t totally lopsided because leaders do more work. Consider the following diagram:
As you can see, while the replicas have to handle only inbound data from the leader, the leader has to handle inbound data from the producer, outbound data to the consumers, and outbound data to all the replicas.
You want to make sure that when a broker fails, the traffic and leadership that gets distributed stays balanced as other brokers pick up the slack.
Needless to say, we watch everything using lots of dashboards (we are New Relic). This screenshot shows a dashboard capturing the total inbound and outbound traffic for our main cluster in one of the data centers, as well as disk I/O per broker.
Dashboards are great: they can show you interesting things at a glance, and they can validate assumptions during operational tasks. But it’s a mistake to rely only on dashboards for detecting problems. The goal is to detect a problem before it becomes critical—by the time it makes it to the dashboard, you should already be firefighting.
Which brings me to DevOps Tip #5: Use New Relic Alerts to detect issues before they get critical, then investigate with queries and dashboards.
Monitoring Kafka while maintaining sanity: consumer lag
Most Kafka users understand that consumer lag is a very big deal. If the rate you’re consuming data out of a topic is slower that the rate of data being produced into that topic, you’re going to experience consumer lag.
So how important is it to stay on top of lag—what’s the worst that could happen? Well, are you scared of permanent data loss?
The information flowing through our Kafka brokers is the lifeblood of our company, but, more important, it’s the lifeblood of our customers’ companies. We can’t afford to lose any of it. In fact, on my team we hate the idea of losing data more than we like the idea of successfully delivering it. We monitor consumer lag very carefully at New Relic.
Let’s dig a little deeper …
Consider the following diagram of a topic partition:
Data comes in from the right side as the producer appends messages to Kafka. They’re usually referred to as offsets, but we’ve listed them as message IDs. The next message produced would be appended with the ID of 130. From the left side, the consumer moves along processing messages. The producer has just appended message 129 and the consumer has just consumed message 126, so it’s lagging by three messages.
The Append time is when the Kafka broker writes the new message to the topic. We’ve written code that monitors this value as well as another important timestamp: Commit time, which is the timestamp a consumer writes after it processes and commits the message.
We use the Append time and Commit time to monitor two types of lag: Append Lag and Commit Lag.
Commit Lag is the difference between the Append time and the Commit time of a consumed message. It basically represents how long the message sat in Kafka before your consumer processed it. For example, you see that the Commit Lag of message 126, which was appended at 1:09 and processed at 1:11, is 2 seconds.
Append Lag is the difference between the Append time of the latest message and the Append time of the last committed message. As you can see, the Append Lag for this consumer is currently 9 seconds.
We recommend that if teams are going to alert on one of these, they should alert on Append Lag. Commit Lag is useful only if you have healthy consumers that can commit. If all of your consumers die at the same time (this has happened), you won’t be able to figure out the Commit Lag. However, Append Lag will continue to increase as more messages are produced, and you can always create alerts for that.
So, DevOps Tip #6: Make sure your lag monitoring still works, especially if all consumers stop committing.
Note: These measurements don’t come out-of-the-box with Kafka. At New Relic we wrote a custom application that monitors all the values shown in the previous diagram for each topic partition and sends that data to Insights, where we can apply alerts.
Avoid your own Kafkapocalypse
After our Kafkapocalypse we started monitoring Replication, Retention, and Consumer Lag, and haven’t experienced such a situation again.
Replication and Retention are particularly important to DevOps engineers monitoring a Kafka cluster. You want to
- Ensure that you monitor both space and time retention.
- Keep as little transient data around but have a quick way to increase retention.
- Balance your cluster by I/O and leadership to account for broker failures.
- Use multi-tiered replication alerts to detect problems and use dashboards and queries to investigate them.
Consumer Lag is of more interest to application developers who have to read data from the Kafka partitions.
Finally, one more quick story:
Several months ago, just after we’d started the multi-tiered alerting for replication, one of our Kafka brokers died because of a hardware failure. This was at around 3 a.m.
The next morning my team saw the emails and Slack messages about having only two copies of that data, but no one had been paged. For a brief moment we started to panic, but then we realized this was by design—we’d gotten a full night’s sleep. We replaced the faulty server, and went about our day.
We had found a level of monitoring that we’re comfortable with. We no longer suffer from alert fatigue. At the same time, these practices are not so hands-off that we risk another Kafkapocalyse.
OK, so that last one was kind of boring. But the happiest DevOps stories are those in which nothing really happens.