Common Elasticsearch Log Analysis for Effective Troubleshooting

Elasticsearch log analysis is usually the fastest way to explain a red cluster, failed indexing request, or slow search complaint. When a cluster has several nodes, the logs tell you which node saw the first failure, which component reacted, and whether the problem is disk, memory, discovery, security, or shard recovery.

This guide shows you how to read Elasticsearch logs without chasing noise. You'll learn where logs usually live, which fields matter, what common failure messages mean, and when to switch from the main server log to slow logs or allocation APIs.

Understanding Elasticsearch Log Structure

Elasticsearch uses Log4j 2 for logging. Package installs usually write log files under /var/log/elasticsearch/. Containerized deployments often send logs to standard output, where your container runtime or logging agent collects them. Depending on your version and log4j2.properties, you may see plain text logs, JSON logs, or both.

Installation Type	Typical Log Path
RPM/DEB Linux package	`/var/log/elasticsearch/`
Docker	Container standard output
ZIP or tarball	`$ES_HOME/logs/`

Common files include the main server log, deprecation logs, slow logs, and sometimes audit logs if security auditing is enabled.

JSON log entries usually include fields like these:

@timestamp: When the event occurred.
level: The severity, such as INFO, WARN, or ERROR.
component: The Elasticsearch class or subsystem that logged the message.
cluster.uuid: The cluster identifier.
node.name: The node that generated the log line.
message: The human-readable event text.

{
  "@timestamp": "2024-01-15T10:30:00.123Z",
  "level": "WARN",
  "component": "o.e.c.r.a.DiskThresholdMonitor",
  "cluster.uuid": "abcde12345",
  "node.name": "es-node-01",
  "message": "high disk watermark [90%] exceeded on [es-node-01]"
}

Prioritizing Messages by Log Level

Filter for WARN and ERROR first, then widen the search around the same timestamp. The lines before the first ERROR often explain the cause better than the final stack trace.

Level	What It Usually Means	First Action
`ERROR`	A request, shard, node, or subsystem failed.	Investigate immediately.
`WARN`	Elasticsearch detected a risky condition.	Check before it becomes an outage.
`INFO`	Normal lifecycle activity.	Use for context around warnings and errors.
`DEBUG` / `TRACE`	Deep diagnostic detail.	Enable briefly only when you need it.

Avoid leaving production nodes at DEBUG or TRACE. Verbose logging can consume disk quickly and add avoidable overhead.

Troubleshooting Common Log Patterns

Elasticsearch logs rarely say "the root cause is X" in one clean sentence. Look for a pattern: the first warning, the component name, the affected index or shard, and the repeated message that follows.

Bootstrap Check Failures

Elasticsearch performs bootstrap checks in production-like network configurations. These checks catch unsafe host settings such as low file descriptor limits, low virtual memory limits, or memory locking problems. If a required check fails, the node refuses to start.

Search for bootstrap checks failed:

[2024-01-15T10:00:00,123][ERROR][o.e.b.BootstrapCheck$Bootstrap] [es-node-01] bootstrap checks failed
[2024-01-15T10:00:00,124][ERROR][o.e.b.BootstrapCheck$Bootstrap] [es-node-01] max file descriptors [4096] for elasticsearch process is too low, increase to at least [65536]

Fix the host setting, restart the node, and confirm the startup log reaches the point where the node joins the cluster.

Network Binding and Discovery Failures

If the node starts but does not join the cluster, search for BindException, master not discovered, discovery, and cluster.initial_master_nodes. A BindException usually points to an address or port conflict. Discovery messages often point to bad seed hosts, blocked transport port 9300, mismatched cluster names, or security settings that stop nodes from trusting each other.

Circuit Breaker Exceptions

Circuit breakers stop requests that would use too much memory. The failed request returns an error, but the node should stay alive.

Search for CircuitBreakingException or Data too large:

[2024-01-15T11:45:20,500][WARN][o.e.c.c.CircuitBreakerService] [es-node-02]
CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [123456789b], which is larger than the limit of [500mb]

Common causes include large aggregations, requests that return too many fields, heavy bulk indexing, or fielddata loaded for text fields. Identify the request pattern, then reduce request size, fix mappings, or add capacity.

Garbage Collection Warnings

The main Elasticsearch log can report long JVM garbage collection pauses. Search for gc, JvmGcMonitorService, and overhead. A few warnings during a load spike may be normal. Repeated warnings paired with rising search latency usually mean the heap is under pressure.

Shard Recovery and Corruption

When a shard fails to allocate or a node detects a bad local shard copy, Elasticsearch logs the index and shard number.

Search for shard failed, failed shard, failed to recover, or the affected index name:

[2024-01-15T12:05:10,999][ERROR][o.e.i.e.Engine] [es-node-03] [my_index][2] fatal error in engine loop
java.io.IOException: Corrupt index files, checksum mismatch

If the message mentions corruption, do not delete files by hand. Preserve logs, check whether a good replica exists, and use Elasticsearch recovery tools and APIs rather than editing the data path directly.

Disk Watermarks

Elasticsearch changes shard allocation behavior when nodes cross disk watermarks. Search for DiskThresholdMonitor, low disk watermark, high disk watermark, or flood-stage disk watermark. Defaults can vary by version and configuration, so confirm your cluster settings before acting:

GET /_cluster/settings?include_defaults=true&filter_path=**.disk.watermark*

If an index becomes read-only after a flood-stage event, clear disk space first. Then remove the block only after the node is safely below the watermark:

PUT /my-index/_settings
{
  "index.blocks.read_only_allow_delete": null
}

Using Slow Logs for Performance Problems

For slow searches or indexing operations, the main server log is often too broad. Slow logs track operations that exceed configured thresholds. Configure them per index with the index settings API.

PUT /my_index/_settings
{
  "index.search.slowlog.threshold.query.warn": "1s",
  "index.search.slowlog.threshold.query.info": "500ms",
  "index.indexing.slowlog.threshold.index.warn": "1s"
}

Slow logs show the index, shard, elapsed time, and request source when configured to include it. Use them to spot repeated expensive queries, broad date ranges, wildcard-heavy searches, and aggregations on fields that are not mapped for efficient aggregation.

A Practical Review Workflow

Start with the user-visible symptom and work backward:

Check the cluster health and affected index.
Search WARN and ERROR logs around the incident time.
Compare logs across nodes using node.name and cluster.uuid.
Follow the first repeated warning, not just the final exception.
Use a targeted API next: allocation explain for unassigned shards, slow logs for slow requests, and node stats for resource pressure.

For example, if Kibana shows a red index, first find the unassigned shard, then search logs for that index and shard number. If the logs mention disk watermarks, fix disk pressure before rerouting anything. If they mention a missing node, recover that node or restore from snapshot before considering risky allocation commands.

Takeaway

Start every Elasticsearch incident by finding the first relevant warning or error, not the loudest final stack trace. Use the main logs for node, discovery, disk, memory, and shard failures. Use slow logs when the cluster is healthy but specific searches or indexing workloads are slow.