Kafka in the Cloud: Why it’s 10x better with Confluent | Find out more

Diagnose and Debug Apache Kafka Issues: Understanding Increased Connections

Written By

When you encounter a problem with Apache Kafka®—for example, an exploding number of connections to your brokers or perhaps some wonky record batching—it’s easy to consider these issues as something to be solved in and of themselves. But, as you’ll soon see, more often than not, these issues are merely symptoms of a wider problem. Rather than treat individual symptoms, wouldn’t it be better to get to the root of the problem with a proper diagnosis?

If you're looking to level up your Kafka debugging game and understand common problems as well as the ailments that an individual symptom could be pointing to, then this blog series is for you. 

Symptoms in the series

Throughout this blog series, we’ll cover a number of common symptoms you may encounter while using Kafka, including:

These issues are common enough that, depending on how badly they’re affecting your normal operations, they might not even draw much attention to themselves. Let’s dive into each of these symptoms individually, learn more about what they are and how they make an impact, and then explore questions to ask yourself to determine the root cause.

In this post, we’ll cover…

Increased connections

If you’ve used Kafka for any amount of time, you’ve likely heard about connections; the most common place that they come up is in regard to clients. Sure, producer and consumer clients connect to the cluster to do their jobs, but it doesn’t stop there. Nearly all interactions across a Kafka cluster occur over connections, so they’re admittedly pretty critical. 

But there’s such a thing as being too connected. Too many connections across a cluster can bog down brokers, potentially impacting requests.

Connections across the cluster

Before diving into an issue caused by increased numbers of connections, it’s important to know the types of connections that are made across your cluster and when they are being made.

Producer and consumer connections

Every time a producer or a consumer client wants to write or read data from a Kafka cluster, they initiate and maintain a connection to the brokers. That makes sense. Consumer clients that are a part of consumer groups also have the added responsibility of maintaining a connection, sending heartbeats, and providing their membership to the ConsumerGroupCoordinator––which is running from within a broker. 

Connections are made as we produce data to and consume data from Kafka. That makes sense. But how many connections are made? Well, it depends on a combination of the number of topics, partitions, and brokers involved as well as a bit of chance. 

For both consumers and producers, the number of connections from a single client is capped by the number of topic-partitions with which the client is interacting. Producers have the ability to potentially produce to every partition within a given topic, so it’s possible that a single producer has to maintain an open connection to every broker depending on where the lead replica of every topic-partition resides. Consumers, on the other hand, can be more efficient in their connections to brokers. Consumers can act within a consumer group and, as such, they will only have a set number of topic-partitions from which to consume. It’s also important to note that when a consumer or producer starts up for the first time, it will connect to one of the bootstrap servers to receive necessary metadata. 

As an example, consider the case where a producer is writing to a topic with 3 partitions and is operating in a cluster with 5 brokers. In this scenario, the producer won’t necessarily maintain an open connection to the brokers that don’t contain partitions for that topic. So we’d just need 3 connections for that producer.

All that being said, it’s actually possible that this producer will maintain 4 open connections depending on which broker the producer connects to on start up. Note that this 4th connection will only be used for the initial metadata call and may not be maintained as long as other connections. This is affected by metadata.max.age.ms (default 5 minutes) which controls the interval at which metadata is refreshed and connections.max.idle.ms (default 9 minutes) which allows idle connections to be cleaned up and dropped.

Broker connections

Brokers connect with each other, but this depends on specific cluster settings. For example, when in-sync replicas are enabled for a cluster, brokers that contain follower instances of a given topic-partition will maintain an open connection between itself and the broker on which the lead topic-partition resides. It uses this connection to periodically fetch data from the leader and stay in-sync.

Additional connections

It doesn’t end there! Depending on what kinds of applications you’re building, there are other ways that connections can be made across your cluster. For example, the AdminClient will create individual connections for each topic it attempts to create. 

More metrics to know

We’re not saying it always comes down to metrics, but it doesn’t not always come down to metrics. When it comes to the number of connections to your cluster at any given time, there are a couple broker and client metrics to keep in mind.

  • kafka.server:type=socket-server-metrics,listener={listener_name},networkProcessor={#},name=connection-count, kafka.producer:type=producer-metrics,client-id=([-.w]+),name=connection-count and kafka.consumer:type=consumer-metrics,client-id=([-.w]+),name=connection-count: Quite simply, this is the total number of active connections to the brokers at any given time. While each of the brokers in the latest Kafka versions can handle thousands of simultaneous connections, you’ll want to keep an eye on the trend of your connection counts. Any unexplained spikes could be cause for concern. The same measurement is also conveniently available as a consumer and a producer metric so that you can see the breakdown for your clients.

  • kafka.server:type=socket-server-metrics,listener={listener_name},networkProcessor={#},name=connection-creation-rate, kafka.producer:type=producer-metrics,client-id=([-.w]+),name=connection-creation-rate and kafka.consumer:type=consumer-metrics,client-id=([-.w]+),name=connection-creation-rate: This broker-, producer-, and consumer-level metric goes hand-in-hand with connection-count, showing the number of new connections that are being created per second. It’s a good metric to alert on in order to pinpoint a connection storm as it happens and also identify which type of client (producer or consumer) could be causing the issue.

  • kafka.network:type=Acceptor,name=AcceptorBlockedPercent,listener={listener_name}: Internally at Confluent, this metric is crucial for identifying when connection storms are occurring through Confluent Cloud; it’s just as important for you to be aware of. Use it in conjunction with any listener, e.g. replication listener or another external one. This metric will give you insights into the percentage of requests that the listener is being blocked from receiving. As an example,  for the replication listener, this value will identify any bottlenecks that might be happening in your replication process. Ideally, this value will be 0; any positive value indicates that connections are being throttled.

Continuing the diagnosis

In addition to seeing an increased number of connections… 

… do you see increased memory consumption? You may want to check if you erroneously created one consumer per thread within a single service instance. See the explanation in the increased rebalance time diagnosis section for more details. But the bottom line is that if you’re moving to a multi-threaded consumer model, avoid creating a consumer per thread as it can increase connections and memory consumption.

… are you witnessing an increased consumer group size and more time to rebalance? Check into your cloud-based KafkaConsumer workloads to see if they’re undersized. This has come up a few times in this blog series; it’s especially relevant in a world with cloud-based Kafka services. If you’re using cloud-based Kafka, it’s reasonable to say that one of the first things you should check when you encounter any issue is whether or not your cloud-based workloads are appropriately sized. It may just save you some time!

… have you seen an increased rate of requests? This could indicate that you’re using multiple KafkaProducer instances within a single service or process. Maybe you’ve recently migrated from another messaging technology and were trying to minimize code changes or perhaps you didn’t quite understand the thread safety of a KafkaProducer. Either way, it could be time to check into your client code.

Conclusion

Given their nature, broker connections can be tough to understand and keep track of, but that doesn’t mean that you can’t have control of your Kafka cluster! With a fresh understanding of all of the connections being made across your cluster and metrics to watch, you should be able to debug and diagnose your next connection-related issue with more confidence. 

To continue on in your Kafka practice, check out these other great resources to help you along the way:

  • Danica began her career as a software engineer in data visualization and warehousing with a business intelligence team where she served as a point-person for standards and best practices in data visualization across her company. In 2018, Danica moved to San Francisco and pivoted to backend engineering with a derivatives data team which was responsible for building and maintaining the infrastructure that processes millions of financial market data per second in near real-time. Her first project on this team involved Kafka Streams – she never looked back. Danica now works as a Developer Advocate with Confluent where she helps others get the most out of their event-driven pipelines.

    Outside of work, Danica is passionate about sustainability, increasing diversity in the technical community, and keeping her many houseplants alive. She can be found on Twitter, tweeting about tech, plants, and baking @TheDanicaFine.

  • Nikoleta Verbeck is a staff solutions engineer at Confluent with the Advanced Technology Group. She has many years of experience with distributed systems, big data, and streaming technologies, serving on the Project Management Committee (PMC) for the Apache Flume project, being an early contributor to the Apache Storm project, and being an early part of the Professional Services Team at Confluent. Nikoleta has headed up architectural design and oversight at a number of companies across real estate, ad tech, telecommunications, and more.

Did you like this blog post? Share it now