Troubleshooting Elasticsearch Cluster Health: A Step-by-Step Guide
Elasticsearch is a robust distributed system, but like any distributed architecture, it requires active monitoring and occasional intervention to maintain optimal health. Cluster health status is the most critical metric for determining the operational readiness and data safety of your deployment. When the cluster transitions from Green to Yellow or, critically, to Red, data integrity or availability is threatened.
This comprehensive guide provides expert steps for diagnosing and resolving common Elasticsearch cluster health issues, focusing specifically on recovering from Yellow and Red statuses. We will use practical Cat APIs and step-by-step checks to quickly identify the root cause and implement corrective actions.
1. Understanding Elasticsearch Cluster Health Status
Before troubleshooting, it is essential to understand what each cluster health color signifies. The health status is determined by the allocation state of your primary and replica shards across the cluster nodes.
| Status | Meaning | Implications |
|---|---|---|
| Green | All primary and replica shards are successfully allocated. | Cluster is fully operational and resilient. |
| Yellow | All primary shards are allocated, but one or more replica shards are unassigned. | Data is available, but the cluster lacks full resilience to node failures. |
| Red | At least one primary shard is unassigned (and thus unavailable). | Data loss or inaccessibility for the index containing the failed shard(s). Critical action required. |
2. Initial Diagnosis: Checking Cluster Health
The first step in any troubleshooting process is to confirm the current status and gather basic metrics using the Cluster Health API and the Nodes API.
Step 2.1: Check Cluster Health
Use the _cat/health API to get a high-level summary. The ?v parameter provides verbose output, including the number of nodes and the total shard count.
GET /_cat/health?v
Example Output (Yellow State):
epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1678233600 09:00:00 my-cluster yellow 3 3 10 5 0 0 5 0 - 50.0%
If the status is Yellow or Red, note the value under unassign.
Step 2.2: Check Node Status and Memory
Ensure all expected nodes are connected and operating correctly. Also, check the heap utilization (critical for performance and stability).
GET /_cat/nodes?v&h=name,node.role,version,heap.percent,disk.total,disk.used,disk.avail
If a node is missing from this list, you may have a connectivity issue or a stopped service.
3. Resolving Red Cluster Status (Primary Shard Failure)
A Red status means data is immediately inaccessible. The goal is to bring the primary shard back online as quickly as possible.
Step 3.1: Identify the Unassigned Primary Shards
Use the _cat/shards API to pinpoint the exact index and shard causing the issue. Look specifically for entries marked as UNASSIGNED with a p (primary) role.
GET /_cat/shards?v | grep UNASSIGNED
Example Output:
index_logs 0 p UNASSIGNED
Step 3.2: Check the Allocation Explanation
This is the single most important diagnostic step. The Allocation Explain API tells you why a specific shard (or any unassigned shard) cannot be allocated.
GET /_cluster/allocation/explain
Common Reasons for Red Status:
- Node Failure: The node holding the primary shard has crashed or been removed from the cluster. If the failed node has sufficient replicas on other nodes, the primary shard should automatically promote a replica. If all copies (primary and replicas) were on the failed node, the shard is lost unless the node is recovered.
- Corrupted Data: The primary shard files on the disk have become corrupt, preventing the node from initializing them.
Step 3.3: Action Plan for Red Status
- Scenario A: Node Offline (Preferred)
- If the node that held the primary shard is simply offline, restore the node service (e.g., restart Elasticsearch or fix network issues). Once the node rejoins the cluster, the primary shard should recover.
- Scenario B: Primary Shard Lost (Last Resort)
- If the node is permanently lost and no replicas existed, the data is gone. You must manually skip the recovery using the
allocate_empty_primarycommand. Warning: This will create a brand new, empty primary shard, resulting in permanent data loss for that segment of the index.
- If the node is permanently lost and no replicas existed, the data is gone. You must manually skip the recovery using the
POST /_cluster/reroute
{
"commands" : [
{
"allocate_empty_primary" : {
"index" : "[index-name]",
"shard" : [shard-id],
"node" : "[target-node-name]",
"accept_data_loss" : true
}
}
]
}
Best Practice: Before resorting to
allocate_empty_primary, always verify that a snapshot or backup does not exist for the index.
4. Resolving Yellow Cluster Status (Replica Shard Failure)
A Yellow status means the cluster is operational but vulnerable. The primary objective is to allocate the missing replicas.
Step 4.1: Use Allocation Explain
If the status is Yellow, use the _cluster/allocation/explain API (Section 3.2) to understand why the replica cannot be assigned. The explanation for replicas is typically more straightforward.
Common Reasons for Yellow Status:
| Reason Code | Explanation | Fix |
|---|---|---|
NO_AVAILABLE_NODES |
Cluster size is too small (e.g., replica count is 2, but only 2 nodes exist). | Add more data nodes or reduce number_of_replicas. |
NOT_ENOUGH_DISK_SPACE |
Nodes hit the low or high watermark threshold. | Delete old indices, free up disk space, or adjust disk watermarks. |
ALLOCATION_DISABLED |
Shard allocation was explicitly disabled by cluster settings. | Re-enable routing using PUT /_cluster/settings. |
PRIMARY_NOT_ACTIVE |
The primary shard is still initializing or recovering. | Wait for the primary to become active (Green). |
Step 4.2: Checking Node Requirements and Constraints
Ensure that the cluster meets the basic requirements for replica allocation:
- Node Count: For
Nreplicas, you need at leastN+1data nodes to ensure primary and replicas are never on the same node. - Disk Watermarks: Elasticsearch stops allocating shards to nodes when disk usage exceeds the high watermark (default 90%).
# Check disk allocation settings
GET /_cluster/settings?flat_settings=true&filter_path=*watermark*
# Example: Setting high watermark to 95% (Temporarily!)
PUT /_cluster/settings
{
"persistent": {
"cluster.routing.allocation.disk.watermark.high": "95%"
}
}
Step 4.3: Manual Reroute (If Allocation Logic Fails)
In rare cases, if the standard allocation process seems stuck despite sufficient resources, you can manually force the allocation of the replica to a specific healthy node using the allocate_replica command.
POST /_cluster/reroute
{
"commands" : [
{
"allocate_replica" : {
"index" : "[index-name]",
"shard" : [shard-id],
"node" : "[target-node-name]"
}
}
]
}
5. Advanced Troubleshooting and Common Pitfalls
If Red or Yellow status persists, the root cause may be outside of standard shard allocation logic.
5.1 Network Connectivity and Split-Brain
In distributed systems, partitioning (split-brain) can cause severe issues. If master-eligible nodes cannot communicate, the cluster might fail to elect a stable master, leading to unassigned shards.
- Action: Verify network connectivity between all nodes, especially between master-eligible nodes.
- Configuration Check: Ensure your
discovery.seed_hostslist is accurate and that thecluster.initial_master_nodessetting was correctly used during cluster bootstrap.
5.2 High JVM Memory Pressure
Excessive heap usage (often above 75%) leads to frequent, long Garbage Collection (GC) pauses. During these pauses, a node can appear unresponsive, causing the master node to drop it, leading to unassigned shards.
- Action: Monitor heap usage (
_cat/nodes?h=heap.percent). If consistently high, consider scaling up node memory, optimizing indexing processes, or implementing index lifecycle management (ILM).
5.3 Shard Allocation Filtering
Accidental application of allocation filters (using node attributes like tags or IDs) can prevent shards from being allocated to nodes that might otherwise be eligible.
# Check for index-level allocation rules
GET /[index-name]/_settings
# Look for: index.routing.allocation.require.*
# Reset index allocation rules (if necessary)
PUT /[index-name]/_settings
{
"index.routing.allocation.require": null
}
Summary Checklist for Quick Recovery
| Status | Primary Diagnostic Tool | Key Action Steps |
|---|---|---|
| Yellow | GET /_cluster/allocation/explain |
1. Check disk space. 2. Verify node count vs. replica count. 3. Look for allocation filtering rules. 4. Wait for primary recovery. |
| Red | GET /_cat/shards?v | grep UNASSIGNED |
1. Check logs on the previously hosting node. 2. Try to restart the failed node. 3. If primary is confirmed lost and no backup exists, use allocate_empty_primary (Data loss risk). |
By systematically utilizing the _cat APIs and the critical _cluster/allocation/explain endpoint, you can rapidly pinpoint the cause of cluster health degradation and implement the necessary corrective steps to restore your cluster to the stable Green status.