Best Practices for Monitoring Kafka Health with Built-in Commands

Use Kafka CLI commands to check topic replication, consumer lag, broker API status, and basic cluster health during incidents.

Best Practices for Monitoring Kafka Health with Built-in Commands

Kafka's built-in commands are the fastest way to answer basic incident questions: are partitions led, are replicas in sync, and are consumers falling behind? They do not replace Prometheus, JMX, or a managed monitoring platform, but they are excellent for quick checks from a bastion host or admin container.

The examples below use --bootstrap-server, which is the current client path for Kafka administration commands.

Set a Clean Command Environment

Keep the broker list in a variable so every command is repeatable:

export KAFKA_HOME=/opt/kafka
export BOOTSTRAP_SERVER="kafka1:9092,kafka2:9092,kafka3:9092"
cd "$KAFKA_HOME/bin"

If your cluster uses TLS or SASL, put client settings in a properties file and pass it with --command-config:

./kafka-topics.sh \
  --bootstrap-server "$BOOTSTRAP_SERVER" \
  --command-config /etc/kafka/admin-client.properties \
  --list

Check Topic and Partition Health

Start with kafka-topics.sh --describe. A healthy topic should have a leader for each partition and an in-sync replica list that matches the expected replication factor.

./kafka-topics.sh \
  --bootstrap-server "$BOOTSTRAP_SERVER" \
  --describe \
  --topic orders

Look for these fields:

  • Leader: should not be none.
  • Replicas: the assigned brokers for the partition.
  • Isr: replicas currently in sync with the leader.

If Replicas has three brokers but Isr has only one, the topic is under-replicated. That usually points to broker downtime, disk pressure, network trouble, or a replica that cannot keep up.

Find Under-Replicated Partitions Quickly

Use the built-in filter when you need a fast cluster-wide check:

./kafka-topics.sh \
  --bootstrap-server "$BOOTSTRAP_SERVER" \
  --describe \
  --under-replicated-partitions

No output is good news. Any output deserves investigation, especially for topics with min.insync.replicas configured for stronger durability.

Monitor Consumer Lag

Consumer lag tells you whether a consumer group is keeping up with produced records.

./kafka-consumer-groups.sh \
  --bootstrap-server "$BOOTSTRAP_SERVER" \
  --describe \
  --group payments-worker

Important columns include:

  • CURRENT-OFFSET: where the group has committed progress.
  • LOG-END-OFFSET: the latest offset in the partition.
  • LAG: the difference between the two.
  • CONSUMER-ID, HOST, and CLIENT-ID: which consumer owns the partition.

Short spikes can be normal during deploys or traffic bursts. Sustained lag means the group needs attention: slow processing, too few consumers, partition imbalance, downstream dependency latency, or broker-side fetch delays.

List Active Consumer Groups

When you do not know the group name, list groups first:

./kafka-consumer-groups.sh \
  --bootstrap-server "$BOOTSTRAP_SERVER" \
  --list

Then inspect the group that maps to the affected application.

Check Broker API Reachability

kafka-broker-api-versions.sh is a simple way to confirm that your client can reach brokers and complete a metadata/API handshake.

./kafka-broker-api-versions.sh \
  --bootstrap-server "$BOOTSTRAP_SERVER"

If this fails, check DNS, security groups or firewalls, TLS/SASL settings, and whether the advertised listener addresses are reachable from where you run the command.

Use CLI Checks During Incidents

A practical triage flow looks like this:

  1. Run kafka-broker-api-versions.sh to confirm connectivity.
  2. Run kafka-topics.sh --describe --under-replicated-partitions to check replication health.
  3. Describe the affected topic and verify leaders and ISR.
  4. Describe the affected consumer group and check lag by partition.
  5. Compare the slow partitions with broker, disk, and application logs.

Takeaway

Kafka's built-in commands give you a reliable first look at cluster health. Keep admin client configs ready, use --bootstrap-server, and focus on leaders, ISR, under-replicated partitions, and consumer lag. Once the CLI shows where the problem sits, deeper broker metrics and logs are much easier to interpret.