Troubleshooting Common Elasticsearch Shard Allocation Failures

Shard allocation failures are where Elasticsearch stops being abstract. The cluster health turns yellow or red, searches start returning partial results, indexing slows down, and the team has to work out whether the problem is disk, a missing node, a bad allocation rule, or damaged shard data.

The mistake I see most often is treating every unassigned shard the same. A replica shard that is delayed after a planned restart is not the same emergency as an unassigned primary shard for the main orders index. Start by finding which shard is unassigned, whether it is primary or replica, and what Elasticsearch says about the allocation decision.

Read the health signal correctly

Start with cluster health:

GET /_cluster/health?pretty

The important fields are status, active_primary_shards, active_shards, relocating_shards, initializing_shards, unassigned_shards, and delayed_unassigned_shards.

Yellow means all primary shards are assigned, but one or more replicas are not. Your data should still be available, but redundancy is reduced.

Red means one or more primary shards are unassigned. Data in those shards is unavailable unless Elasticsearch can promote a replica, recover the node that had the shard, or restore from a snapshot.

If relocating_shards or initializing_shards is nonzero, the cluster may already be healing. Do not interrupt a normal recovery just because the color is temporarily yellow.

List the unassigned shards

Use _cat/shards to see the exact problem:

GET /_cat/shards?v&h=index,shard,prirep,state,node,unassigned.reason&s=state,index

Look for UNASSIGNED. The prirep column tells you whether the shard is a primary (p) or a replica (r). The unassigned.reason column gives a short reason such as NODE_LEFT, INDEX_CREATED, CLUSTER_RECOVERED, or ALLOCATION_FAILED.

For a large cluster, narrow it:

GET /_cat/shards/logs-*?v&h=index,shard,prirep,state,node,unassigned.reason

Once you have the index, shard number, and primary/replica flag, ask Elasticsearch for the real explanation.

Use the allocation explain API

For any currently unassigned shard:

GET /_cluster/allocation/explain
{}

For a specific shard:

GET /_cluster/allocation/explain
{
  "index": "logs-2026.05.24",
  "shard": 0,
  "primary": false
}

Read can_allocate, allocate_explanation, unassigned_info, and node_allocation_decisions. The node decisions are especially useful because they show why each node was rejected. Common deciders include disk thresholds, same-shard rules, allocation filters, awareness rules, and total-shards-per-node limits.

If the output says no_valid_shard_copy for a primary, treat it seriously. Elasticsearch does not currently see a usable copy of that primary shard.

Cause 1: not enough suitable nodes

A simple single-node cluster with one replica will be yellow forever. Elasticsearch will not put a replica on the same node as its primary. A three-node cluster with an index configured for two replicas needs three suitable data nodes. If allocation awareness says copies must be spread across zones, you also need enough nodes in the required zones.

Check replica settings:

GET /my-index/_settings?filter_path=*.settings.index.number_of_replicas

If this is a lab or temporary environment, reduce replicas:

PUT /my-index/_settings
{
  "index": {
    "number_of_replicas": 0
  }
}

For production, the better answer is usually to add suitable data nodes or adjust an unrealistic replica count. Lowering replicas removes redundancy.

Cause 2: disk watermarks

Disk pressure is one of the most common allocation blockers. Elasticsearch uses disk watermarks to avoid filling nodes. When nodes cross thresholds, Elasticsearch may stop assigning shards to them and may move shards away.

Check allocation and disk usage:

GET /_cat/allocation?v
GET /_cat/nodes?v&h=name,ip,disk.used_percent,disk.avail,heap.percent,ram.percent,node.role

The allocation explain output may say a node is above the low or high disk watermark. If an index has hit flood-stage conditions, Elasticsearch may also set a write block on affected indices.

Good fixes are capacity fixes: delete old indices after confirming snapshots, add disk, add data nodes, move data to another tier, or shorten retention through Index Lifecycle Management.

Changing watermarks can be reasonable in a controlled emergency, but it is not a capacity plan. If every data node is nearly full, raising thresholds just lets the cluster run closer to failure.

After freeing space from a flood-stage event, check for a read-only block:

GET /my-index/_settings?filter_path=*.settings.index.blocks.write,*.settings.index.blocks.read_only_allow_delete

Remove the block only after disk pressure is resolved:

PUT /my-index/_settings
{
  "index.blocks.read_only_allow_delete": null
}

Cause 3: allocation disabled after maintenance

Teams often disable allocation during rolling maintenance, then forget to turn it back on.

Check cluster settings:

GET /_cluster/settings?include_defaults=true&pretty

Look for cluster.routing.allocation.enable. Values include all, primaries, new_primaries, and none. If it is none, replicas and possibly other shard movements will not allocate normally.

Re-enable allocation:

PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.enable": "all"
  }
}

Also check transient settings. A transient maintenance setting can still affect the cluster even if the persistent section looks fine.

Cause 4: restrictive allocation filters

Index-level filters can pin an index to certain nodes:

GET /my-index/_settings?filter_path=*.settings.index.routing.allocation.*

Cluster-level filters can exclude nodes from allocation:

GET /_cluster/settings?include_defaults=true&filter_path=**.cluster.routing.allocation.*

Node attributes matter too:

GET /_cat/nodeattrs?v

A typical failure looks like this: an index requires box_type: hot, but the hot nodes were replaced and the new nodes do not have node.attr.box_type: hot. Elasticsearch is following the rule exactly; the rule is now wrong.

To remove overly restrictive index filters:

PUT /my-index/_settings
{
  "index.routing.allocation.require.box_type": null,
  "index.routing.allocation.include.box_type": null,
  "index.routing.allocation.exclude.box_type": null
}

Use the exact setting names present in your index. Do not wipe allocation rules blindly if they encode real zone or tier requirements.

Cause 5: delayed allocation after a node leaves

When a node leaves, Elasticsearch may delay allocating replica shards because the node might come back quickly. This avoids copying large shards across the network during a normal restart.

Check delayed shards:

GET /_cluster/health?pretty

If delayed_unassigned_shards is greater than zero and the node is expected back, waiting may be the best action. You can also inspect index settings:

GET /my-index/_settings?filter_path=*.settings.index.unassigned.node_left.delayed_timeout

The default is commonly one minute, but always check your cluster and version. Some teams increase it for planned rolling restarts of large shards. Do not make it so long that real failures leave replicas missing for an uncomfortable amount of time.

Cause 6: too many shards on a node

index.routing.allocation.total_shards_per_node can limit how many shards from one index may live on the same node. Cluster-level shard limits can also apply. These settings are useful, but they can block allocation in small clusters.

Check index settings:

GET /my-index/_settings?filter_path=*.settings.index.routing.allocation.total_shards_per_node

If you have five primary shards, one replica, two data nodes, and a low per-node limit, Elasticsearch may have no legal placement. Fix the limit, add nodes, or redesign the shard count.

Cause 7: no valid copy of a primary

This is the scary case. Allocation explain may report that there is no valid shard copy for a primary. Maybe the only node with the primary is gone. Maybe the disk failed. Maybe shard data is corrupted.

First, try to recover the missing node if it is expected to return. Check system logs, Elasticsearch logs, disk health, and network connectivity. If a valid replica exists, Elasticsearch should normally promote it.

If no valid copy exists, restore from a snapshot:

POST /_snapshot/my_repository/snapshot_name/_restore
{
  "indices": "affected-index"
}

If the data can be rebuilt from a source system and you accept losing the shard contents, allocate_empty_primary is available, but it is a data-loss operation:

POST /_cluster/reroute
{
  "commands": [
    {
      "allocate_empty_primary": {
        "index": "affected-index",
        "shard": 0,
        "node": "target-node",
        "accept_data_loss": true
      }
    }
  ]
}

Do not use this to “make the cluster green” unless you have consciously decided that the missing data is gone or rebuildable.

Watch recovery

After making a change, watch progress:

GET /_cat/recovery?v&active_only=true
GET /_cat/shards?v&h=index,shard,prirep,state,node,unassigned.reason
GET /_cluster/health?pretty

Large shards take time. If byte counts are moving in _cat/recovery, the cluster is working. If nothing changes, check allocation explain again. Elasticsearch’s decision may have changed after your first fix, revealing the next blocker.

Prevention that actually helps

Monitor disk before watermarks are reached. Alert on trends, not just full disks.

Use ILM or data streams for logs and metrics so retention is automatic.

Keep snapshots current and test restores. A snapshot you have never restored is only a hope.

Keep shard sizes and shard counts reasonable. Too many tiny shards make allocation and recovery slower than the data volume suggests.

Document allocation filters and node attributes. Six months later, someone will replace a node and forget the attribute that made an index allocatable.

Treat yellow as a warning and red as an incident. Yellow can be acceptable during maintenance, but it should not become background noise. Red means at least one primary shard is unavailable, and the longer you wait, the fewer easy recovery options you may have.

A field checklist for incidents

When shard allocation breaks, collect the same evidence every time. It keeps the team from bouncing between theories.

Run:

GET /_cluster/health?pretty
GET /_cat/nodes?v&h=name,ip,roles,master,disk.used_percent,heap.percent,ram.percent
GET /_cat/shards?v&h=index,shard,prirep,state,node,unassigned.reason&s=state,index
GET /_cat/allocation?v
GET /_cat/recovery?v&active_only=true
GET /_cluster/settings?include_defaults=true&pretty

Then run allocation explain for one representative unassigned shard. If there are many, group them by reason. Ten unassigned replicas blocked by disk watermarks are one problem. Three primaries with no_valid_shard_copy are a different problem.

Write down whether the affected data can be rebuilt. Logs from an upstream queue, metrics from agents, and derived search indices may be recoverable from source systems. User-created content or compliance records may not be. Recovery commands should follow that business reality.

When to wait and when to act

Wait when recovery is actively progressing, the missing node is expected back soon, or delayed allocation is doing exactly what you configured it to do. You can verify progress with _cat/recovery; moving byte counts and file counts are good signs.

Act when allocation explain shows a permanent blocker: no suitable nodes, disk watermarks on every node, allocation disabled, missing node attributes, or no valid shard copy. Waiting will not fix a rule that rejects every node.

Escalate quickly when primary shards are unassigned for important indices. Replica failures reduce safety. Primary failures reduce availability.

Avoid making recovery slower

Large recoveries compete with normal search and indexing. Adding too many nodes at once, restarting more nodes, or raising recovery concurrency without checking disk and network capacity can make the cluster less stable.

If you tune recovery settings, do it deliberately and record the original values. Settings such as concurrent recoveries can help in some environments and hurt in others. Faster recovery on paper can overload disks and increase query latency enough that users experience a worse outage.

Keep an eye on hot nodes. Allocation may technically succeed while placing too much work on one node because of shard sizes, tier rules, or uneven disk usage. Use _cat/allocation, node stats, and your monitoring system to confirm the cluster is balanced after the immediate failure clears.

After-action fixes

Most shard allocation incidents have a prevention story. Disk watermark incidents point to retention, ILM, or capacity planning. Allocation filter incidents point to missing runbook documentation. No-valid-copy incidents point to snapshots and upstream replay. Slow recovery points to shard sizing and hardware.

Do not close the incident just because health is green. Remove temporary replica changes, restore normal refresh intervals, re-enable allocation if it was changed, verify snapshots, and add the alert that would have caught the issue earlier.

Special case: closed and hidden indices

Sometimes an index is not allocating because it is closed, hidden, or part of a system feature you did not realize you were touching. Be careful with broad wildcard commands when system indices are present. In modern clusters, security, Kibana, transforms, and other stack features may maintain their own indices.

Use narrow patterns and include hidden indices only when you mean to inspect them. If a system index has allocation trouble, check the related stack component logs as well as Elasticsearch. For example, a Kibana saved-object index problem may show up as Elasticsearch shard allocation trouble and as Kibana startup failures.

The rule is the same as with user data: identify the exact index, understand what owns it, then choose the fix. Do not delete or force-allocate a system index just to clear a red health status unless you understand the product-level impact.