Resolving the Red Cluster Status: A Step-by-Step Elasticsearch Troubleshooting Guide
A practical Elasticsearch red cluster checklist covering unassigned primaries, allocation explain, disk watermarks, and node loss.
Resolving the Red Cluster Status: A Step-by-Step Elasticsearch Troubleshooting Guide
Red Elasticsearch cluster status means at least one primary shard is not allocated. That is the part that matters. Some data may be unavailable, searches against affected indices may return partial or failed results, and writes to those shards cannot proceed normally.
Yellow is different: primaries are allocated, but one or more replicas are not. Yellow still deserves attention because you have less redundancy, but red is the incident. Do not begin by deleting data or rerouting shards by hand. First find which primary is unassigned and why Elasticsearch refuses to allocate it.
Understanding Elasticsearch Cluster Health
Elasticsearch provides a Cluster Health API that offers a snapshot of the cluster's status and shard allocation. This API is your primary tool for diagnosing health issues.
GET _cluster/health
The output of this command will include a status field, which can be green, yellow, or red. It also provides information about the number of active and unassigned shards.
- Green: All primary and replica shards are allocated and functioning correctly.
- Yellow: All primary shards are allocated, but some replica shards are unassigned.
- Red: One or more primary shards are unassigned, leading to data unavailability for those shards.
Use a more detailed health call when you are in an incident:
GET _cluster/health?level=indices
Then list the unassigned shards:
GET _cat/shards?v&h=index,shard,prirep,state,unassigned.reason,node&s=state,index
Common Causes and Troubleshooting Steps for Red/Yellow Status
When your cluster is not green, it's time to investigate. Here are the most common reasons for unassigned shards and how to address them:
1. Insufficient Disk Space
Elasticsearch has safeguards to prevent data corruption due to full disks. If a node runs out of disk space, it will prevent new shards from being allocated or existing ones from being recovered.
Diagnosis:
- Check disk usage on each node.
- Use the Cluster Allocation Explain API to understand why shards are unassigned.
GET _cluster/allocation/explain
This API will provide detailed reasoning, often pointing to disk watermarks.
Resolution:
- Free up disk space: Delete old indices, move data to another tier, or add capacity. Force merging active indices is not a quick disk-space fix and can add heavy I/O during an incident.
- Add more disk space: Increase the storage capacity of your nodes.
- Configure disk watermarks: Adjust
cluster.routing.allocation.disk.watermark.low,high, andflood_stageonly when the current values are wrong for your environment. Raising watermarks can buy time, but it can also hide a real capacity problem.
2. Node Left the Cluster (Node Eviction)
Nodes can leave a cluster due to network issues, crashes, or being intentionally removed. If a node holding shards (especially primary shards) leaves, those shards become unassigned.
Diagnosis:
- Check the cluster logs for nodes that have recently left.
- Monitor network connectivity between nodes.
- Ensure all nodes are discoverable by each other. Check
discovery.seed_hosts, transport connectivity, and cluster logs. Do not reintroducecluster.initial_master_nodesinto an existing formed cluster as a generic fix.
Resolution:
- Restart the node: If the node crashed or became unresponsive, try restarting it.
- Address network issues: Resolve any network connectivity problems between nodes.
- Re-add the node: If the node was intentionally removed, ensure it's configured correctly before rejoining the cluster.
3. Shard Allocation Filtering and Awareness
Improperly configured shard allocation rules can prevent shards from being assigned to available nodes.
Diagnosis:
- Review your
cluster.routing.allocation.*settings, particularlycluster.routing.allocation.include,exclude, andrequirefilters. - Check
cluster.routing.allocation.awareness.attributesif you are using zone or rack awareness.
Resolution:
- Adjust allocation filters: Modify the filters to allow shards to be allocated to the appropriate nodes.
- Correct awareness attributes: Ensure nodes are correctly tagged with awareness attributes if used, and that your allocation rules respect these.
4. Insufficient Disk Space for Allocation (Post-Index Creation)
Even if a disk isn't full, Elasticsearch might prevent shard allocation if it predicts the disk will exceed high watermarks after allocation. This is related to the disk watermarks but specifically impacts new allocations.
Diagnosis:
- The
_cluster/allocation/explainAPI is invaluable here. - Check the free space available versus the expected size of the shards.
Resolution:
- Similar to the general disk space issue: free up space, add more storage, or adjust watermarks cautiously.
5. Shard Size and Node Capacity
Very large shards or a large number of shards can strain node resources (CPU, memory) and affect allocation. Also, if a node has reached its shard limit (cluster.routing.allocation.total_shards_per_node), new shards won't be allocated to it.
Diagnosis:
- Check shard sizes (
GET _cat/shards?v). - Monitor node resource utilization (CPU, memory).
- Review the
cluster.routing.allocation.total_shards_per_nodesetting.
Resolution:
- Reduce shard pressure: For future indices, adjust rollover and shard counts so shards land in a manageable size range. For existing indices, use reindex, shrink, or split only after the cluster is stable enough to handle the work.
- Increase node capacity: Add more powerful nodes or nodes with more memory/CPU.
- Adjust shard limit: If necessary and you have sufficient resources, increase
cluster.routing.allocation.total_shards_per_node.
6. Master Node Issues
An unstable master node can lead to shard allocation problems. If the master is unavailable or unable to perform its duties, shards may become unassigned.
Diagnosis:
- Check the cluster logs for master-related errors or warnings.
- Ensure you have an odd number of master-eligible nodes (typically 3 or 5) to avoid split-brain scenarios.
- Verify that master-eligible nodes can elect a master.
Resolution:
- Stabilize the master: Ensure master-eligible nodes are healthy, have sufficient resources, and are well-connected.
- Check bootstrap history:
cluster.initial_master_nodesis for first cluster formation only. After bootstrap, remove it from node configs and troubleshoot master instability through logs, transport networking, and voting configuration.
Advanced Troubleshooting with _cluster/allocation/explain
The _cluster/allocation/explain API is your most powerful tool for understanding why a specific shard is unassigned.
Example:
GET _cluster/allocation/explain
{
"index": "my-index",
"shard": 0,
"primary": true
}
This will return detailed JSON output explaining why the primary shard 0 of my-index cannot be allocated. Look for fields like deciders which list the reasons for unassignment (e.g., DISK_THRESHOLD, NODE_LEFT, NO_VALID_SHARD_COPY).
Resolving Yellow Cluster Status
A yellow status means primary shards are allocated, but replicas are not. This primarily impacts data redundancy and fault tolerance.
Common Causes:
- Insufficient nodes: You don't have enough nodes to accommodate the required number of replicas for your indices.
- Shard allocation filtering: Similar to red status, filters might be preventing replicas from being allocated.
- Disk space constraints: Nodes might have enough space for primary shards but not enough for replicas, especially if disk watermarks are active.
Resolution:
- Add more nodes: Increase the number of nodes in your cluster.
- Adjust replica count: Reduce the number of replicas per index (
index.number_of_replicas) if fault tolerance is not critical for all indices. - Check allocation settings: Ensure replica shards are allowed to be allocated to the available nodes.
Best Practices for Maintaining Cluster Health
- Monitor Disk Usage: Proactively monitor disk space on all nodes and set up alerts.
- Right-size your Cluster: Ensure you have enough nodes and resources for your data volume and query load.
- Shard Management: Keep shard sizes within recommended ranges and avoid over-sharding.
- Regularly Review Cluster Health: Use
GET _cluster/healthandGET _cluster/allocation/explainas part of your routine monitoring. - Test Changes: Before making significant changes to allocation settings or disk watermarks, test them in a staging environment.
Once you know the allocation decider, the path is usually clear. Disk threshold means capacity. NODE_LEFT means recover or replace the missing node. NO_VALID_SHARD_COPY means you may need a snapshot restore or a deliberate data-loss decision using Elasticsearch's documented unsafe recovery procedures. That last case should be handled slowly, with backups checked first, because the command that gets the cluster out of red can also confirm permanent loss of the missing primary's latest data.