Best Practices for Monitoring Kafka Health with Built-in Commands
Use Kafka CLI commands to check topic replication, consumer lag, broker API status, and basic cluster health during incidents.
Best Practices for Monitoring Kafka Health with Built-in Commands
Kafka's built-in commands are the fastest way to answer basic incident questions: are partitions led, are replicas in sync, and are consumers falling behind? They do not replace Prometheus, JMX, or a managed monitoring platform, but they are excellent for quick checks from a bastion host or admin container.
The examples below use --bootstrap-server, which is the current client path for Kafka administration commands.
Set a Clean Command Environment
Keep the broker list in a variable so every command is repeatable:
export KAFKA_HOME=/opt/kafka
export BOOTSTRAP_SERVER="kafka1:9092,kafka2:9092,kafka3:9092"
cd "$KAFKA_HOME/bin"
If your cluster uses TLS or SASL, put client settings in a properties file and pass it with --command-config:
./kafka-topics.sh \
--bootstrap-server "$BOOTSTRAP_SERVER" \
--command-config /etc/kafka/admin-client.properties \
--list
Check Topic and Partition Health
Start with kafka-topics.sh --describe. A healthy topic should have a leader for each partition and an in-sync replica list that matches the expected replication factor.
./kafka-topics.sh \
--bootstrap-server "$BOOTSTRAP_SERVER" \
--describe \
--topic orders
Look for these fields:
Leader: should not benone.Replicas: the assigned brokers for the partition.Isr: replicas currently in sync with the leader.
If Replicas has three brokers but Isr has only one, the topic is under-replicated. That usually points to broker downtime, disk pressure, network trouble, or a replica that cannot keep up.
Find Under-Replicated Partitions Quickly
Use the built-in filter when you need a fast cluster-wide check:
./kafka-topics.sh \
--bootstrap-server "$BOOTSTRAP_SERVER" \
--describe \
--under-replicated-partitions
No output is good news. Any output deserves investigation, especially for topics with min.insync.replicas configured for stronger durability.
Monitor Consumer Lag
Consumer lag tells you whether a consumer group is keeping up with produced records.
./kafka-consumer-groups.sh \
--bootstrap-server "$BOOTSTRAP_SERVER" \
--describe \
--group payments-worker
Important columns include:
CURRENT-OFFSET: where the group has committed progress.LOG-END-OFFSET: the latest offset in the partition.LAG: the difference between the two.CONSUMER-ID,HOST, andCLIENT-ID: which consumer owns the partition.
Short spikes can be normal during deploys or traffic bursts. Sustained lag means the group needs attention: slow processing, too few consumers, partition imbalance, downstream dependency latency, or broker-side fetch delays.
List Active Consumer Groups
When you do not know the group name, list groups first:
./kafka-consumer-groups.sh \
--bootstrap-server "$BOOTSTRAP_SERVER" \
--list
Then inspect the group that maps to the affected application.
Check Broker API Reachability
kafka-broker-api-versions.sh is a simple way to confirm that your client can reach brokers and complete a metadata/API handshake.
./kafka-broker-api-versions.sh \
--bootstrap-server "$BOOTSTRAP_SERVER"
If this fails, check DNS, security groups or firewalls, TLS/SASL settings, and whether the advertised listener addresses are reachable from where you run the command.
Use CLI Checks During Incidents
A practical triage flow looks like this:
- Run
kafka-broker-api-versions.shto confirm connectivity. - Run
kafka-topics.sh --describe --under-replicated-partitionsto check replication health. - Describe the affected topic and verify leaders and ISR.
- Describe the affected consumer group and check lag by partition.
- Compare the slow partitions with broker, disk, and application logs.
Takeaway
Kafka's built-in commands give you a reliable first look at cluster health. Keep admin client configs ready, use --bootstrap-server, and focus on leaders, ISR, under-replicated partitions, and consumer lag. Once the CLI shows where the problem sits, deeper broker metrics and logs are much easier to interpret.