Troubleshooting: Checking and Interpreting Elasticsearch Cluster Health Status

Master the essential techniques for diagnosing Elasticsearch cluster health. This guide details how to use the `_cat/health` API to check status and interpret the crucial Green, Yellow, and Red indicators. Learn the root causes of unassigned shards, how to use advanced APIs like `_cat/shards` and `_cluster/allocation/explain` for deep diagnostics, and the actionable steps required to resolve critical cluster instability quickly and effectively.

Troubleshooting: Checking and Interpreting Elasticsearch Cluster Health Status

Elasticsearch cluster health is one of those checks that looks simple until the pager goes off. The API gives you a color, but the color is only the starting point. A green cluster can still be slow. A yellow cluster can be perfectly usable for a short maintenance window. A red cluster can mean one small test index is unavailable, or it can mean customer-facing search is missing real data.

When I check Elasticsearch cluster health, I try not to jump straight from red to dangerous recovery commands. I want to answer three questions first: are primary shards assigned, are replicas assigned, and is the cluster currently trying to recover by itself? The commands below are the ones I use to move from a broad health color to a specific reason.

Start with the health API

For a quick terminal view, _cat/health is fine:

curl -s "http://localhost:9200/_cat/health?v"

A typical response looks like this:

epoch      timestamp cluster     status node.total node.data shards pri relo init unassign pending_tasks
1762219800 12:10:00  logs-prod   yellow          3         3    124  62    0    0        2             0

The fields I look at first are status, node.total, node.data, relo, init, unassign, and pending_tasks. A yellow status with init or relo greater than zero may simply be a cluster recovering after a restart. A yellow status with unassigned shards and no movement usually needs investigation.

For automation, use the JSON API instead of parsing _cat output:

curl -s "http://localhost:9200/_cluster/health?pretty"

That response includes fields such as active_primary_shards, active_shards, relocating_shards, initializing_shards, unassigned_shards, and delayed_unassigned_shards. Those names are easier to use in scripts and monitoring checks.

What green, yellow, and red really mean

Green means every primary shard and every configured replica shard is assigned. It does not mean queries are fast, disk is healthy, or mappings are well designed. It only means Elasticsearch has placed the shards it is supposed to place.

Yellow means all primary shards are assigned, but at least one replica shard is unassigned. Your data should still be searchable because primaries are available. The risk is redundancy. If the node holding a primary fails while its replica is still unassigned, that index can become red.

Red means at least one primary shard is unassigned. Searches against affected indices may fail or return partial results, and writes to those shards cannot proceed normally. Red deserves immediate attention, but the correct action depends on why the primary is unassigned.

A common small-cluster example is a single-node development cluster with one replica configured. It will stay yellow because Elasticsearch will not put a replica on the same node as its primary. That is not a mystery and not a reason to force allocation. Either add another data node or set replicas to zero for that index:

curl -X PUT "http://localhost:9200/my-index/_settings"   -H 'Content-Type: application/json'   -d '{"index":{"number_of_replicas":0}}'

Do not use that setting casually in production. It removes redundancy for that index.

Find the exact unassigned shards

After the health color, list the shards:

curl -s "http://localhost:9200/_cat/shards?v&h=index,shard,prirep,state,node,unassigned.reason" | sort

Look for UNASSIGNED. The prirep column tells you whether the shard is a primary (p) or replica (r). That distinction matters more than the color itself. A few unassigned replicas usually mean reduced fault tolerance. One unassigned primary means at least part of an index is unavailable.

If you see many unassigned shards after a planned node restart, also check delayed allocation:

curl -s "http://localhost:9200/_cluster/health?pretty" | grep delayed_unassigned_shards

Elasticsearch may wait before reallocating replicas after a node leaves, because the node may come back quickly. That behavior avoids unnecessary network and disk churn during rolling restarts.

Ask Elasticsearch why allocation failed

The allocation explain API is the best next step. You can ask for any unassigned shard:

curl -X GET "http://localhost:9200/_cluster/allocation/explain?pretty"   -H 'Content-Type: application/json'   -d '{}'

Or ask about a specific shard:

curl -X GET "http://localhost:9200/_cluster/allocation/explain?pretty"   -H 'Content-Type: application/json'   -d '{
    "index": "logs-2026.05.24",
    "shard": 0,
    "primary": false
  }'

Read unassigned_info, can_allocate, and node_allocation_decisions. The useful part is usually plain English: disk watermark exceeded, allocation disabled, no matching node attribute, too many shards on a node, or a replica cannot be placed because only one node exists.

If the explanation says allocation_delayed, wait only if the missing node is expected to return soon. If the explanation says no node satisfies allocation rules, waiting will not fix it.

Yellow cluster playbook

For yellow health, I use this order:

  1. Check whether the cluster has enough data nodes for the configured replica count.
  2. Check disk watermarks with _cat/allocation.
  3. Check whether allocation was disabled during maintenance.
  4. Check index-level routing filters and awareness rules.
  5. Decide whether to add capacity, lower replica count, or fix a bad rule.

The node count check is simple. If an index has number_of_replicas: 2, Elasticsearch needs three suitable data nodes to place one primary plus two replicas. “Suitable” matters. If allocation awareness requires separate zones, you need nodes in those zones, not just any three nodes.

Check allocation and disk:

curl -s "http://localhost:9200/_cat/allocation?v"

If nodes are above disk watermarks, Elasticsearch may refuse new shard allocations. Free space, add nodes, expand disks, or delete old indices after taking snapshots. Raising watermarks can buy time in a controlled emergency, but it does not create capacity.

Check allocation settings:

curl -s "http://localhost:9200/_cluster/settings?include_defaults=true&pretty"

If cluster.routing.allocation.enable is none, allocation is disabled. That is common after maintenance scripts that forgot to turn it back on. Re-enable it with:

curl -X PUT "http://localhost:9200/_cluster/settings?pretty"   -H 'Content-Type: application/json'   -d '{
    "persistent": {
      "cluster.routing.allocation.enable": "all"
    }
  }'

Also check whether the value was set as transient; persistent and transient settings can both affect behavior.

Red cluster playbook

For red health, slow down and identify the blast radius. Do not start with allocate_empty_primary. That command accepts data loss by design.

First, find the affected primary shards:

curl -s "http://localhost:9200/_cat/shards?v&h=index,shard,prirep,state,node,unassigned.reason"   | grep ' p '   | grep UNASSIGNED

Then inspect one with allocation explain:

curl -X GET "http://localhost:9200/_cluster/allocation/explain?pretty"   -H 'Content-Type: application/json'   -d '{
    "index": "affected-index",
    "shard": 0,
    "primary": true
  }'

If the primary is unassigned because a node is down, your best recovery may be to restore that node. Check the service, disk, JVM logs, and network path. If a replica copy exists on another node, Elasticsearch should normally promote it. If it does not, the explain output and logs usually tell you why.

If the data is lost or corrupted, restore from a snapshot. That is the clean recovery path. If no snapshot exists and the data can be rebuilt from another source, you may decide to allocate an empty primary:

curl -X POST "http://localhost:9200/_cluster/reroute?pretty"   -H 'Content-Type: application/json'   -d '{
    "commands": [
      {
        "allocate_empty_primary": {
          "index": "affected-index",
          "shard": 0,
          "node": "target-node-name",
          "accept_data_loss": true
        }
      }
    ]
  }'

Only use that when losing the shard contents is acceptable. The name is literal: Elasticsearch allocates an empty primary and moves on.

Watch recovery instead of guessing

After a fix, watch shard movement:

curl -s "http://localhost:9200/_cat/recovery?v&active_only=true"
curl -s "http://localhost:9200/_cat/shards?v&h=index,shard,prirep,state,node,unassigned.reason"
curl -s "http://localhost:9200/_cluster/health?pretty"

Recovery can be limited by disk speed, network bandwidth, shard size, and cluster recovery settings. A large shard may sit in INITIALIZING for longer than you expect. That is different from being stuck. If byte counts and file counts are moving in _cat/recovery, let it work.

Also check pending cluster tasks when health is not changing:

curl -s "http://localhost:9200/_cat/pending_tasks?v"

A long queue can point to an overloaded master node or repeated allocation decisions that cannot complete.

A practical example

Say _cat/health shows yellow with two unassigned shards. _cat/shards shows both are replicas for logs-2026.05.24. Allocation explain says the cluster cannot allocate because every data node is above the low disk watermark. The fix is not to reroute shards manually. The fix is capacity: delete old indices after snapshotting them, add storage, add data nodes, or move cold data elsewhere.

Another example: a three-node cluster is yellow after a rolling restart. _cluster/health shows delayed_unassigned_shards: 8. The stopped node is already coming back. In that case, waiting a few minutes may be correct. Forcing allocation immediately can create extra recovery work and make the restart slower.

A third example: a single-node lab cluster is yellow forever. _cat/shards shows every unassigned shard is a replica. The index has one replica. Elasticsearch is behaving correctly. Set replicas to zero for the lab or add a second data node.

Keep the health check honest

Cluster health should be part of monitoring, but alert rules need context. Alert immediately on red. Alert on yellow when it lasts beyond a short maintenance window, when unassigned replicas are increasing, or when the reason is disk pressure. Track disk watermarks, node count, JVM pressure, and snapshot success alongside health color. The color tells you where to start; the shard and allocation APIs tell you what to do next.

When health checks disagree with user symptoms

Sometimes the cluster is green and users are still complaining. That is not a contradiction. Cluster health is about shard assignment, not query latency or correctness. If health is green but searches are slow, move to search latency, thread pools, hot shards, JVM pressure, and storage latency. A green cluster with one overloaded data node can still feel broken.

The reverse also happens. A cluster can be yellow for a harmless reason, such as a single-node development environment with replicas configured. The useful habit is to connect the health state to business impact. Which index is affected? Is it a primary or replica? Is the application reading from that index right now? Is this during planned maintenance? Those questions keep you from treating every yellow status like a disaster.

For customer-facing systems, I like to keep a small runbook table outside Elasticsearch: index pattern, owning service, data source, snapshot policy, whether data can be replayed, and who approves destructive recovery. During a red incident, that table is often more useful than another dashboard. If clickstream-* can be replayed from Kafka, the recovery choice is different from an index that holds user-generated documents with no upstream copy.

Safer command habits

Use explicit index names when you can. Wildcards are convenient, but they hide blast radius. Before running any command that changes settings or deletes data, list what the pattern matches:

curl -s "http://localhost:9200/_cat/indices/logs-prod-*?v&s=index"

Keep command output from the incident. Paste allocation explain results, shard listings, and health responses into the ticket. Elasticsearch state changes quickly during recovery, and you may need the earlier output to understand why a decision was made.

If security is enabled, run these commands with a user that has the minimum useful privileges for diagnostics and a separate, more restricted process for destructive operations. In a stressful incident, it is too easy to paste a write command into the same shell where you were only inspecting health.

What to check after the cluster returns to green

Green is not the end of the incident. Check whether replicas rebuilt on the nodes you expected, whether disk is still close to watermarks, and whether any index was left with temporary settings such as number_of_replicas: 0, a long refresh_interval, or disabled allocation.

Also confirm snapshots are succeeding after recovery. A cluster that just had shard trouble may have exposed a gap in retention, repository credentials, or snapshot scheduling. If the recovery depended on luck because no snapshot existed, write that down and fix it before the next failure.

Finally, review alerts. If humans noticed the issue before monitoring did, add or tune alerts for red health, long-lasting yellow health, disk watermark pressure, missing nodes, failed snapshots, and repeated master elections. A cluster health color is useful, but the best alert tells you why the color changed and which index is affected.