Essential Tools and Techniques for Debugging Elasticsearch Cluster Issues

Debug Elasticsearch cluster issues with cat APIs, allocation explain, logs, node stats, and focused shard checks.

Essential Tools and Techniques for Debugging Elasticsearch Cluster Issues

Elasticsearch cluster issues usually show up as a red or yellow health status, slow searches, rejected writes, or nodes dropping from the cluster. The fastest way to debug them is to start with cluster health, then narrow the problem to shards, nodes, allocation rules, logs, or resource pressure.

This guide walks through the built-in tools you will use most often: _cat APIs, _cluster/allocation/explain, node stats, pending tasks, and Elasticsearch logs.

Understanding Elasticsearch Cluster Health

Cluster health gives you the first signal:

  • green: All primary and replica shards are allocated.
  • yellow: All primary shards are allocated, but one or more replica shards are not.
  • red: One or more primary shards are unassigned, so some data is unavailable.

A yellow cluster can still serve reads and writes for available primary shards, but it has less redundancy. A red cluster needs immediate investigation because affected primary shards are unavailable.

Start With the _cat APIs

The _cat APIs are built for quick human-readable checks.

curl -X GET "localhost:9200/_cat/health?v"
curl -X GET "localhost:9200/_cat/nodes?v&h=name,ip,heap.percent,ram.percent,cpu,disk.used_percent,load_1m,node.role"
curl -X GET "localhost:9200/_cat/shards?v"
curl -X GET "localhost:9200/_cat/indices?v"

Use _cat/health to confirm the overall state. Use _cat/shards to find UNASSIGNED, INITIALIZING, or repeatedly relocating shards. Use _cat/nodes to spot heap, CPU, or disk pressure on a specific node.

For a red or yellow cluster, this command gives you a focused view:

curl -X GET "localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason,node&s=state,index"

Explain Shard Allocation

When a shard is unassigned, _cluster/allocation/explain tells you why Elasticsearch cannot place it.

curl -X GET "localhost:9200/_cluster/allocation/explain?pretty" \
  -H 'Content-Type: application/json' -d'
{
  "index": "my_index",
  "shard": 0,
  "primary": true
}'

You can also ask Elasticsearch to explain the first unassigned shard it finds:

curl -X GET "localhost:9200/_cluster/allocation/explain?pretty" \
  -H 'Content-Type: application/json' -d'{}'

Read the can_allocate, allocate_explanation, and node_allocation_decisions fields. Common causes include disk watermarks, allocation filtering, missing nodes, incompatible index settings, or too few data nodes for the requested replica count.

Check Node and Cluster Stats

When health is green but searches or writes are slow, check resource pressure.

curl -X GET "localhost:9200/_nodes/stats/jvm,fs,os,process,thread_pool?pretty"
curl -X GET "localhost:9200/_cluster/stats?pretty"

Look for high JVM heap usage, disk pressure, rejected search or write tasks, and nodes with much higher load than their peers. A single overloaded node can slow the whole cluster if it owns hot shards.

For thread pool rejections, use:

curl -X GET "localhost:9200/_cat/thread_pool/search,write?v&h=node_name,name,active,queue,rejected,completed"

Rejected tasks usually mean the node could not keep up with the request rate. Fix the cause before raising queues: reduce query cost, spread shards, scale nodes, or slow bulk indexing.

Review Pending Tasks and Recovery

If cluster state changes feel stuck, check pending tasks:

curl -X GET "localhost:9200/_cluster/pending_tasks?pretty"

A long queue can point to master node pressure, frequent mapping updates, shard churn, or unstable nodes.

For shard movement and recovery, use:

curl -X GET "localhost:9200/_cat/recovery?v&active_only=true"

This helps you separate a cluster that is actively recovering from one that is blocked by allocation rules or missing data.

Read the Elasticsearch Logs

Elasticsearch logs often explain what the APIs only hint at. Check logs on the affected node, not just a random node in the cluster.

Search for messages such as:

  • master not discovered
  • flood-stage disk watermark
  • circuit_breaking_exception
  • rejected execution
  • failed to obtain node locks
  • shard failed

For example, a flood-stage disk watermark can block writes by setting affected indices to read-only until disk pressure is resolved. After freeing disk or adding capacity, clear the write block only after you understand why the disk filled:

curl -X PUT "localhost:9200/*/_settings?expand_wildcards=all" \
  -H 'Content-Type: application/json' -d'
{
  "index.blocks.read_only_allow_delete": null
}'

A Practical Debugging Flow

Use this order when you are not sure where to start:

  1. Check _cat/health?v to see whether the problem is cluster-wide.
  2. Use _cat/shards?v to find unassigned, relocating, or hot shards.
  3. Run _cluster/allocation/explain for unassigned shards.
  4. Check _cat/nodes for heap, CPU, disk, and node roles.
  5. Review node logs for allocation, disk, JVM, and circuit breaker messages.
  6. Use node stats and thread pool stats if the issue is latency or rejected requests.

Key Takeaway

Debugging Elasticsearch works best when you move from broad health checks to the exact shard, node, or setting causing the issue. Start with _cat/health, _cat/shards, and allocation explain, then use logs and node stats to confirm the root cause before changing settings.