Essential Tools and Techniques for Debugging Elasticsearch Cluster Issues
Debug Elasticsearch cluster issues with cat APIs, allocation explain, logs, node stats, and focused shard checks.
Essential Tools and Techniques for Debugging Elasticsearch Cluster Issues
Elasticsearch cluster issues usually show up as a red or yellow health status, slow searches, rejected writes, or nodes dropping from the cluster. The fastest way to debug them is to start with cluster health, then narrow the problem to shards, nodes, allocation rules, logs, or resource pressure.
This guide walks through the built-in tools you will use most often: _cat APIs, _cluster/allocation/explain, node stats, pending tasks, and Elasticsearch logs.
Understanding Elasticsearch Cluster Health
Cluster health gives you the first signal:
green: All primary and replica shards are allocated.yellow: All primary shards are allocated, but one or more replica shards are not.red: One or more primary shards are unassigned, so some data is unavailable.
A yellow cluster can still serve reads and writes for available primary shards, but it has less redundancy. A red cluster needs immediate investigation because affected primary shards are unavailable.
Start With the _cat APIs
The _cat APIs are built for quick human-readable checks.
curl -X GET "localhost:9200/_cat/health?v"
curl -X GET "localhost:9200/_cat/nodes?v&h=name,ip,heap.percent,ram.percent,cpu,disk.used_percent,load_1m,node.role"
curl -X GET "localhost:9200/_cat/shards?v"
curl -X GET "localhost:9200/_cat/indices?v"
Use _cat/health to confirm the overall state. Use _cat/shards to find UNASSIGNED, INITIALIZING, or repeatedly relocating shards. Use _cat/nodes to spot heap, CPU, or disk pressure on a specific node.
For a red or yellow cluster, this command gives you a focused view:
curl -X GET "localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason,node&s=state,index"
Explain Shard Allocation
When a shard is unassigned, _cluster/allocation/explain tells you why Elasticsearch cannot place it.
curl -X GET "localhost:9200/_cluster/allocation/explain?pretty" \
-H 'Content-Type: application/json' -d'
{
"index": "my_index",
"shard": 0,
"primary": true
}'
You can also ask Elasticsearch to explain the first unassigned shard it finds:
curl -X GET "localhost:9200/_cluster/allocation/explain?pretty" \
-H 'Content-Type: application/json' -d'{}'
Read the can_allocate, allocate_explanation, and node_allocation_decisions fields. Common causes include disk watermarks, allocation filtering, missing nodes, incompatible index settings, or too few data nodes for the requested replica count.
Check Node and Cluster Stats
When health is green but searches or writes are slow, check resource pressure.
curl -X GET "localhost:9200/_nodes/stats/jvm,fs,os,process,thread_pool?pretty"
curl -X GET "localhost:9200/_cluster/stats?pretty"
Look for high JVM heap usage, disk pressure, rejected search or write tasks, and nodes with much higher load than their peers. A single overloaded node can slow the whole cluster if it owns hot shards.
For thread pool rejections, use:
curl -X GET "localhost:9200/_cat/thread_pool/search,write?v&h=node_name,name,active,queue,rejected,completed"
Rejected tasks usually mean the node could not keep up with the request rate. Fix the cause before raising queues: reduce query cost, spread shards, scale nodes, or slow bulk indexing.
Review Pending Tasks and Recovery
If cluster state changes feel stuck, check pending tasks:
curl -X GET "localhost:9200/_cluster/pending_tasks?pretty"
A long queue can point to master node pressure, frequent mapping updates, shard churn, or unstable nodes.
For shard movement and recovery, use:
curl -X GET "localhost:9200/_cat/recovery?v&active_only=true"
This helps you separate a cluster that is actively recovering from one that is blocked by allocation rules or missing data.
Read the Elasticsearch Logs
Elasticsearch logs often explain what the APIs only hint at. Check logs on the affected node, not just a random node in the cluster.
Search for messages such as:
master not discoveredflood-stage disk watermarkcircuit_breaking_exceptionrejected executionfailed to obtain node locksshard failed
For example, a flood-stage disk watermark can block writes by setting affected indices to read-only until disk pressure is resolved. After freeing disk or adding capacity, clear the write block only after you understand why the disk filled:
curl -X PUT "localhost:9200/*/_settings?expand_wildcards=all" \
-H 'Content-Type: application/json' -d'
{
"index.blocks.read_only_allow_delete": null
}'
A Practical Debugging Flow
Use this order when you are not sure where to start:
- Check
_cat/health?vto see whether the problem is cluster-wide. - Use
_cat/shards?vto find unassigned, relocating, or hot shards. - Run
_cluster/allocation/explainfor unassigned shards. - Check
_cat/nodesfor heap, CPU, disk, and node roles. - Review node logs for allocation, disk, JVM, and circuit breaker messages.
- Use node stats and thread pool stats if the issue is latency or rejected requests.
Key Takeaway
Debugging Elasticsearch works best when you move from broad health checks to the exact shard, node, or setting causing the issue. Start with _cat/health, _cat/shards, and allocation explain, then use logs and node stats to confirm the root cause before changing settings.