Troubleshooting Elasticsearch Cluster Health: A Step-by-Step Guide

Facing a yellow or red Elasticsearch cluster? Diagnose unassigned shards, disk pressure, node loss, and safe recovery options.

Troubleshooting Elasticsearch Cluster Health: A Step-by-Step Guide

A yellow or red Elasticsearch cluster is not a mystery state. It usually means Elasticsearch cannot place one or more shards where it wants to place them. The work is to find which shard is stuck, why allocation is blocked, and whether the right fix is to wait, free resources, bring back a node, restore from snapshot, or deliberately accept data loss.

I treat cluster health as a triage signal, not as the diagnosis itself. Green means every primary and replica shard is assigned. Yellow means all primary shards are assigned, so searches and writes can usually continue, but at least one replica is missing. Red means at least one primary shard is unassigned, so part of at least one index is unavailable. Red is the one that can break application reads or writes immediately.

Start by getting the simple view:

GET /_cluster/health?pretty
GET /_cat/health?v

Look at status, number_of_nodes, active_primary_shards, unassigned_shards, initializing_shards, and relocating_shards. If you see initializing or relocating shards after a node restart, the cluster may already be recovering. Do not start changing allocation settings before you know whether Elasticsearch is simply doing work.

Then list the unassigned shards:

GET /_cat/shards?v&h=index,shard,prirep,state,node,unassigned.reason&s=state,index,shard

The prirep column matters. A p shard is primary. A red cluster always has at least one unassigned primary. An r shard is a replica. A yellow cluster usually has unassigned replicas only.

The most useful API in this situation is allocation explain:

GET /_cluster/allocation/explain?pretty

For a specific shard, be explicit:

GET /_cluster/allocation/explain?pretty
{
  "index": "logs-2026.05.24",
  "shard": 0,
  "primary": false
}

Read the can_allocate answer and the node-level decisions. Elasticsearch will usually tell you exactly what rule blocked allocation: disk watermarks, allocation filtering, same-shard rules, delayed allocation after a node left, missing primary data, incompatible versions, or a node role mismatch.

When the Cluster Is Yellow

Yellow is common on small clusters. The classic case is a one-node development cluster with number_of_replicas: 1. Elasticsearch cannot put a replica on the same node as its primary, so the replica remains unassigned forever. That is not an emergency in a laptop environment. It is a configuration mismatch.

Check the replica count:

GET /my-index/_settings?filter_path=*.settings.index.number_of_replicas

For a single-node non-production cluster, set replicas to zero:

PUT /my-index/_settings
{
  "index": {
    "number_of_replicas": 0
  }
}

For production, do not hide the problem by reducing replicas unless you are intentionally accepting less redundancy. If the index is supposed to have one replica, you need at least two eligible data nodes. If it has two replicas, you need at least three eligible data nodes. Tiering can make this less obvious: a warm-index replica cannot allocate to a hot-only node if allocation rules require warm nodes.

Disk pressure is the next common yellow cause. Check node disk usage:

GET /_cat/allocation?v
GET /_cat/nodes?v&h=name,roles,disk.used_percent,disk.avail,heap.percent,cpu,load_1m

Elasticsearch uses disk watermarks to avoid filling nodes. Defaults vary by version and configuration, so inspect your actual cluster settings:

GET /_cluster/settings?include_defaults=true&flat_settings=true&filter_path=**cluster.routing.allocation.disk.watermark**

If a node is over the high watermark, Elasticsearch will avoid allocating more shards there. If it reaches the flood-stage watermark, Elasticsearch may place affected indices into a write-blocked state to protect the node. The durable fix is to delete old data, move data to more nodes, increase disk, shrink oversized shard counts, or adjust ILM retention. Temporarily raising watermarks can buy time, but it should not be your first move.

A practical cleanup sequence looks like this:

GET /_cat/indices?v&s=store.size:desc
GET /_cat/shards?v&s=store:desc

Find large old indices, verify retention expectations with the owning team, snapshot if needed, then delete only data you are allowed to remove:

DELETE /old-logs-2025.12.*

After freeing space, allocation may resume automatically. If it does not, rerun allocation explain. The old reason may still be cached in your head, but the cluster may now be blocked by a different rule.

Allocation filtering is another frequent yellow cause, especially after hardware migrations. Someone may have set an index to require a node attribute that no longer exists:

GET /my-index/_settings?flat_settings=true&filter_path=*.settings.index.routing.allocation*
GET /_cluster/settings?flat_settings=true&filter_path=**routing.allocation**

If the rule is wrong, remove or update it:

PUT /my-index/_settings
{
  "index.routing.allocation.require.box_type": null,
  "index.routing.allocation.include._name": null,
  "index.routing.allocation.exclude._name": null
}

Use the exact keys your settings show. Do not paste a broad reset into production without reading it; allocation rules are sometimes there for a good reason, such as keeping certain data on a compliance-controlled tier.

When the Cluster Is Red

Red deserves slower hands and better notes. The first question is whether the missing primary shard has a recoverable copy somewhere.

List unassigned primary shards:

GET /_cat/shards?v&h=index,shard,prirep,state,unassigned.reason | grep ' p UNASSIGNED'

Then check which nodes are present:

GET /_cat/nodes?v&h=name,ip,roles,master,uptime,heap.percent,disk.avail

If a node is missing, your best recovery path is often to bring that node back. Check the service, disk mount, host networking, certificates, and logs on the missing node. A node that lost access to its data path may start as a different empty node, which does not help recover the primary shard.

On the Elasticsearch node, logs usually show the real failure sooner than the APIs do. Look for messages about shard lock failures, corrupt index files, master discovery, TLS handshake errors, disk read-only filesystems, or node role changes. A common real-world failure is a node restart after a disk was remounted under a different path. Elasticsearch comes up, but the data path is empty, so the cluster still lacks the shard copy it needs.

Run allocation explain for the primary:

GET /_cluster/allocation/explain?pretty
{
  "index": "orders-2026.05.24",
  "shard": 2,
  "primary": true
}

If the explanation says no valid shard copy can be found, stop and check snapshots before doing anything destructive:

GET /_snapshot/_all
GET /_snapshot/my-repository/_all?verbose=false

Restoring a snapshot is usually safer than allocating an empty primary. An empty primary creates a new blank shard for that shard ID. It is not a repair operation. It tells Elasticsearch, "I accept that the old data for this shard is gone."

The last-resort command looks like this:

POST /_cluster/reroute
{
  "commands": [
    {
      "allocate_empty_primary": {
        "index": "orders-2026.05.24",
        "shard": 2,
        "node": "es-data-03",
        "accept_data_loss": true
      }
    }
  ]
}

Use it only after you have confirmed there is no usable node copy and no snapshot you can restore. In an incident, write down who approved that choice and which index and shard were affected. Future debugging is much easier when the data-loss decision is explicit.

Cases That Look Like Allocation Problems But Are Really Cluster Problems

Sometimes shards are unassigned because the cluster cannot keep stable membership. If master-eligible nodes cannot talk to each other, the elected master may change repeatedly, and allocation will churn or pause. Check master stability:

GET /_cat/master?v
GET /_cat/nodes?v&h=name,roles,master,ip

If the master changes often, inspect network reliability, DNS, node certificates, and discovery settings. For modern Elasticsearch clusters, cluster.initial_master_nodes is for initial cluster bootstrapping, not a setting to leave as a general discovery crutch forever. discovery.seed_hosts should point to appropriate seed hosts, and all nodes must use the same cluster name and compatible security settings.

High JVM pressure can also cause allocation symptoms. A data node stuck in long garbage collection pauses may leave and rejoin the cluster from the master's point of view. That can create unassigned shards even though the machine never fully crashed.

Check heap and garbage collection logs:

GET /_cat/nodes?v&h=name,heap.percent,ram.percent,cpu,load_1m,node.role
GET /_nodes/stats/jvm?filter_path=nodes.*.jvm.mem,nodes.*.jvm.gc

If heap is consistently high, do not just increase heap blindly. Elasticsearch generally performs best when heap leaves enough memory for the filesystem cache. Look for oversized aggregations, heavy fielddata use, too many shards, aggressive indexing, or queries that need better mappings.

Shard count can be the quiet cause behind many health problems. A cluster with many tiny shards spends too much effort tracking metadata and moving shards around. Use:

GET /_cat/indices?v&h=index,pri,rep,docs.count,store.size,pri.store.size&s=pri:desc
GET /_cluster/stats?filter_path=indices.shards,indices.count,nodes.count

If every daily log index has many primary shards but little data, fix the index template for future indices. Then consider shrink, rollover, or reindex plans for existing data.

A Practical Triage Order

When someone says "Elasticsearch is red," I use this order:

  1. Confirm health with _cluster/health.
  2. List unassigned shards with _cat/shards.
  3. Separate primary failures from replica failures.
  4. Run _cluster/allocation/explain on one representative shard.
  5. Check whether all expected nodes are present.
  6. Check disk watermarks and allocation rules.
  7. For red primaries, try to recover the missing node or restore from snapshot before considering empty primary allocation.
  8. After the cluster turns green, find the cause that made it unhealthy in the first place.

That last step matters. A cluster can go green after you add disk, restart a node, or reduce replicas, but the same incident will return if ILM retention is wrong, shard counts are too high, nodes are undersized, or a deployment process keeps changing node attributes.

Cluster health troubleshooting is less about memorizing one magic command and more about refusing to guess. Elasticsearch exposes the allocation decision. Read it, verify it against node and index settings, and choose the smallest fix that matches the actual blocker.

After the Cluster Is Green Again

Do not close the incident just because the color changed. Green only means shards are assigned now. It does not prove the cluster is healthy enough for the next traffic spike, disk growth cycle, or node restart. I like to capture a short after-action note while the details are still fresh: which indices were affected, which nodes were involved, which allocation rule blocked recovery, and what command or infrastructure change fixed it.

Check whether the fix created a new risk. If you reduced replicas to turn yellow into green, record that the index now has less redundancy. If you raised disk watermarks, add a reminder to lower them after capacity is added. If you restored a snapshot, verify the restored index has the expected aliases and write settings before applications resume normal writes.

A few quick checks help catch unfinished work:

GET /_cat/recovery?v&active_only=true
GET /_cat/pending_tasks?v
GET /_cat/aliases?v
GET /_cluster/health?wait_for_status=green&timeout=30s

pending_tasks should not grow forever. Recovery should eventually empty out. Aliases matter because restoring an index under a different name can leave the application writing to the old broken target or reading from only part of the intended data.

Also check write blocks after disk incidents:

GET /*/_settings?filter_path=*.settings.index.blocks*

If Elasticsearch set a flood-stage block, remove it only after disk pressure is fixed:

PUT /my-index/_settings
{
  "index.blocks.read_only_allow_delete": null
}

The most useful prevention work is usually boring: working snapshots, tested restores, realistic ILM retention, enough disk headroom, and shard counts that match the size of the cluster. A cluster with reliable snapshots and sane shard sizing is far easier to recover than a cluster with clever emergency commands and no restore path.

What Not to Do During Health Incidents

Do not restart every node at once. It is tempting when the cluster looks unhealthy, but a rolling, observed approach is safer. Restarting healthy nodes can remove shard copies that Elasticsearch needs for recovery. If you must restart, do one node at a time and wait for the cluster to stabilize between steps.

Do not disable allocation and forget about it. Temporary allocation changes are common during maintenance, but a forgotten setting can leave replicas unassigned long after the maintenance window ends. Always check both persistent and transient settings:

GET /_cluster/settings?flat_settings=true&include_defaults=false

Do not delete indices based only on size. Large indices may be business-critical. Small indices may be safe to remove. Tie cleanup to retention policy, snapshots, and application ownership. In a real outage, the fastest safe cleanup is usually deleting known-expired log or metric indices, not guessing from a sorted size list.

Do not assume Kibana and Elasticsearch use the same language for the problem. Kibana may show a broad red status while Elasticsearch APIs show the precise unassigned shard. Use the UI for visibility, but use the APIs for the decision.