A Deep Dive into Kafka ZooKeeper Connection Problems

Kafka ZooKeeper connection problems mostly affect older Kafka clusters and clusters that have not moved to KRaft mode. Newer Kafka deployments can run without ZooKeeper, but plenty of production systems still depend on it. If your brokers use zookeeper.connect in server.properties, ZooKeeper is still part of your control plane and deserves the same care as Kafka itself.

When a Kafka broker cannot maintain its ZooKeeper session, the symptoms can look bigger than a simple connection issue. Brokers may restart. Controller elections may repeat. Partitions may become unavailable. Logs may show session expiration, controller resignation, or repeated reconnect attempts. Producers and consumers may only see the downstream effect: metadata errors, timeouts, or unstable leaders.

Start with the role ZooKeeper plays. In ZooKeeper-based Kafka clusters, brokers register themselves there, controller election depends on it, and cluster metadata coordination goes through it. If a broker loses its ZooKeeper session long enough for the session to expire, Kafka treats that broker as gone from the cluster. Even if the broker process is still running, the cluster may move leadership away from it.

The first check is boring and often catches the problem: verify zookeeper.connect on every broker.

zookeeper.connect=zk01.example.com:2181,zk02.example.com:2181,zk03.example.com:2181/kafka
zookeeper.connection.timeout.ms=18000
zookeeper.session.timeout.ms=18000

The connection string should list the ensemble members Kafka can reach. If you use a chroot path such as /kafka, include it consistently on every broker. Do not configure half the brokers with /kafka and the other half without it; they will behave like they are talking to different Kafka clusters. If you do use a chroot, create it first or confirm it exists with ZooKeeper tooling.

Check DNS as well as the text of the config. A hostname that resolves correctly from your laptop may fail from a broker subnet. Run the checks from the Kafka broker host, not from a bastion unless the bastion has the same network path.

getent hosts zk01.example.com
nc -vz zk01.example.com 2181
nc -vz zk02.example.com 2181
nc -vz zk03.example.com 2181

A successful TCP connection does not prove ZooKeeper is healthy, but a failed connection is enough to keep digging into firewalls, security groups, routing, DNS, or listener configuration. Test every Kafka broker to every ZooKeeper node. Partial connectivity is worse than a clean outage because the failure may only appear when a broker tries to connect to one specific ensemble member.

ZooKeeper's four-letter commands can help when they are enabled. Many installations restrict them, so do not assume they work. If allowed, ruok should return imok, and mntr can show useful server stats.

echo ruok | nc zk01.example.com 2181
echo mntr | nc zk01.example.com 2181

If these commands are disabled, use the supported admin tooling or your monitoring stack instead. The point is to answer a simple question: is ZooKeeper listening, participating in the ensemble, and responding quickly?

Next, inspect ZooKeeper ensemble health. A three-node ensemble can tolerate one ZooKeeper node being down. It cannot tolerate two. A five-node ensemble can tolerate two. Avoid even-sized ensembles because they add cost without improving quorum in the way people expect. Three and five are common choices.

On the ZooKeeper side, look at zoo.cfg. Confirm clientPort, tickTime, initLimit, syncLimit, and server lines. Make sure the advertised server hostnames are reachable between ZooKeeper nodes, not only from Kafka brokers. ZooKeeper peers need their own quorum and leader election ports. A Kafka broker may reach 2181 while the ZooKeeper ensemble itself is unhealthy because peer traffic is blocked.

Session timeout tuning is another common source of confusion. Kafka asks ZooKeeper for a session timeout, but ZooKeeper enforces limits based on its own configuration. In ZooKeeper, the minimum session timeout is typically 2 * tickTime and the maximum is typically 20 * tickTime, unless overridden by specific server settings. That means a Kafka timeout value outside the allowed range may be adjusted by ZooKeeper.

If tickTime=2000, the usual allowed session range is roughly 4 seconds to 40 seconds. A Kafka setting such as zookeeper.session.timeout.ms=18000 fits inside that range. A very low timeout may produce false failures during short network pauses or garbage collection pauses. A very high timeout can make real broker failures take longer to detect. You are choosing between sensitivity and stability.

Do not change tickTime casually. It affects the ZooKeeper ensemble, not only Kafka. If you need more tolerance for broker pauses, it is often better to start by reviewing Kafka's zookeeper.session.timeout.ms, broker JVM behavior, and network health before changing ZooKeeper timing.

Logs usually tell the story if you line them up by timestamp. On Kafka brokers, search for messages around ZooKeeper disconnects and session expiration:

rg -i "zookeeper|session|expired|controller|reconnect" /var/log/kafka/server.log

Patterns matter more than a single line. A one-time reconnect during a planned ZooKeeper restart may be harmless. Repeated expiration every few minutes points to instability. Expiration during garbage collection points toward JVM pauses or broker overload. Expiration at the same time on many brokers points toward ZooKeeper, the network, or a shared infrastructure event.

On ZooKeeper nodes, check for leader changes, fsync warnings, connection throttling, and long request latency. ZooKeeper is sensitive to disk latency because it writes transaction logs. A slow disk can make the service appear reachable while still failing to respond quickly enough for stable sessions.

Network latency and packet loss are more important than raw bandwidth for ZooKeeper. Kafka brokers do not need huge throughput to ZooKeeper, but they need reliable, low-latency communication. If brokers and ZooKeeper are split across distant networks, expect trouble. Keep them close. In cloud environments, avoid routing broker-to-ZooKeeper traffic through unnecessary NAT, overloaded firewalls, or cross-region paths.

Resource contention on the Kafka broker can look exactly like a ZooKeeper problem. If the JVM stops the world for a long garbage collection pause, the broker may miss heartbeats. If CPU is saturated, heartbeat handling can be delayed. If the host is stuck in high I/O wait, Kafka may not keep up with coordination work. Check broker metrics at the same timestamps as the ZooKeeper disconnects.

Useful broker-side questions include: did heap usage climb before the disconnect, did GC pause time spike, was disk I/O wait high, did network retransmits increase, and were there large partition reassignments or leader movements at the same time? A broker drowning under load may need fewer partition leaders, better disk, JVM tuning, or a traffic shift. Increasing ZooKeeper timeouts may hide the symptom without fixing the cause.

Configuration consistency is easy to overlook. All Kafka brokers in the same cluster should use the same ZooKeeper connection string and chroot. They should also have unique broker.id values. A duplicated broker ID can cause confusing registration behavior because two processes are trying to represent the same broker.

If you recently changed ZooKeeper hostnames, certificates, firewall rules, or Kafka broker configs, compare the working broker with the failing one. Small differences are common: an old DNS suffix, a missing chroot path, a security group attached to two brokers but not the third, or a typo in one systemd environment file.

Recovery depends on what broke. If a firewall rule was missing, fix it and restart the affected broker if it does not reconnect cleanly. If ZooKeeper lost quorum, restore quorum first before bouncing Kafka brokers. If a broker expired because it was overloaded, restarting may bring it back temporarily, but the problem will return unless you remove the pressure.

Use rolling restarts. Restarting every Kafka broker at once because ZooKeeper was flaky can turn a partial outage into a full one. Bring back ZooKeeper health, then restart or recover brokers one at a time while watching controller stability and partition leadership.

For long-term stability, monitor both sides. On ZooKeeper, watch request latency, outstanding requests, leader changes, follower sync status, disk space, and process restarts. On Kafka, watch controller changes, offline partitions, under-replicated partitions, broker restarts, and logs mentioning ZooKeeper session expiration. Alert on repeated patterns, not only total process death.

The cleanest fix, for teams planning a larger upgrade, may be migration away from ZooKeeper to Kafka's KRaft mode. That is a project, not an incident response step. It requires version planning, compatibility checks, and careful migration work. Until then, treat ZooKeeper as production infrastructure. Keep it small, close to Kafka, consistently configured, monitored, and boring.

One practical runbook pattern is to build a small matrix during the incident. Put Kafka brokers on one axis and ZooKeeper nodes on the other. Fill each cell with the result of nc -vz host 2181 and, if available, a simple ZooKeeper health check. This turns vague "Kafka cannot reach ZooKeeper" reports into a visible pattern. If every broker fails to reach zk02, investigate zk02 or its network path. If only broker-4 fails to reach every ZooKeeper node, investigate that broker's host, route table, DNS, or firewall.

Time synchronization can also matter. ZooKeeper session mechanics do not require perfectly identical wall clocks for every operation, but badly skewed clocks make logs harder to interpret and can break surrounding automation, certificates, and monitoring. Keep NTP or chrony healthy on Kafka and ZooKeeper nodes. When timestamps disagree during an outage, people waste time chasing the wrong sequence of events.

Be careful with containerized or orchestrated ZooKeeper deployments. ZooKeeper stores identity and data on disk. If pods move and lose persistent identity, or if service discovery points clients at nodes that are not ready, Kafka may see unstable connection behavior. StatefulSet-style identity, persistent volumes, stable DNS, and readiness checks matter. A ZooKeeper ensemble should not behave like a set of disposable stateless web pods.

Security settings add another layer. If SASL, TLS, or network policy was recently introduced, connection failures may look like plain reachability problems at first. Check whether Kafka logs show authentication failures, handshake failures, or authorization errors rather than TCP timeouts. A port can be open while the session still fails because the broker cannot authenticate to ZooKeeper.

After the incident, keep a short record of the exact symptom, root cause, and fix. ZooKeeper problems often repeat because the original repair was local: one firewall rule, one broker restart, one timeout bump. A good post-incident note should say whether the cluster had quorum, which brokers lost sessions, whether ISR shrank, whether the controller changed, and what monitoring will catch it earlier next time.

If you are troubleshooting from Kubernetes or another scheduler, also check where the Kafka and ZooKeeper workloads landed. A node-level network issue, disk issue, or CPU starvation event can affect only the pods scheduled there. Moving a pod may appear to fix the problem, but the real issue may be the host. Compare events and node metrics before declaring the application repaired.

Backups and snapshots deserve caution. ZooKeeper data directories should not be snapshotted casually while the process is active unless your backup method is designed for it. For Kafka metadata, a damaged or stale ZooKeeper state can be extremely disruptive. Follow ZooKeeper-supported backup practices and test restore procedures away from production. A backup that no one has restored is only a hopeful file.

The best preventive move is to keep ZooKeeper boring. Do not co-locate it with heavy Kafka brokers if you can avoid it. Keep its disks reliable. Keep heap sizing conservative and monitored. Limit who can change ensemble membership. Most ZooKeeper incidents I have seen were not caused by exotic bugs; they came from ordinary infrastructure drift around a small service everyone forgot was critical.