Troubleshooting Common Elasticsearch Performance Bottlenecks
A practical workflow for finding Elasticsearch performance bottlenecks in indexing, search, heap, storage, and shard design.
Troubleshooting Common Elasticsearch Performance Bottlenecks
Troubleshooting Elasticsearch performance bottlenecks works best when you resist the first easy theory. A slow dashboard might be a bad query, but it might also be a hot shard, a saturated disk, a heap problem, a mapping mistake, or a recovery process competing for I/O. Start with evidence, then narrow the scope.
I usually split the question into three parts: what is slow, where is it slow, and what changed. "Elasticsearch is slow" is not actionable. "Search latency for logs-prod-* doubled after yesterday's mapping change, mostly on two data nodes" gives you somewhere to work.
Diagnosing Performance Issues
Before diving into specific solutions, it's essential to have tools and methods for diagnosing performance problems. Elasticsearch provides several APIs and metrics that are invaluable for this process.
Key Tools and Metrics:
- Cluster Health API (
_cluster/health): Provides an overview of the cluster's status (green, yellow, red), number of nodes, shards, and pending tasks. High numbers of pending tasks can indicate indexing or recovery issues. - Node Stats API (
_nodes/stats): Offers detailed statistics for each node, including CPU usage, memory, disk I/O, network traffic, and JVM heap usage. This is critical for identifying resource-bound nodes. - Index Stats API (
_stats): Provides statistics for individual indices, such as indexing rates, search rates, and cache usage. This helps pinpoint problematic indices. - Slow Log: Elasticsearch can log slow indexing and search requests. Slow-log thresholds are index settings, so you can apply them to one noisy index instead of turning the whole cluster into a log generator.
- Indexing Slow Log: Useful when bulk writes pause or ingest latency jumps.
- Search Slow Log: Useful when you need the actual request pattern, not just a latency chart.
- Monitoring Tools: Solutions like Kibana's Monitoring UI, Prometheus with the Elasticsearch Exporter, or commercial APM tools provide dashboards and historical data for deeper analysis.
Common Bottlenecks and Solutions
1. Slow Indexing
Slow indexing can be caused by various factors, including network latency, disk I/O bottlenecks, insufficient resources, inefficient mapping, or suboptimal bulk API usage.
Causes and Solutions:
Disk I/O Saturation: Elasticsearch is heavily reliant on fast disk I/O for indexing. SSDs are highly recommended.
- Diagnosis: Monitor disk read/write IOPS and throughput using
_nodes/statsor OS-level tools. Look for high queue depths. - Solution: Upgrade to faster storage (SSDs), distribute shards across more nodes, or optimize your shard strategy to reduce I/O per node.
- Diagnosis: Monitor disk read/write IOPS and throughput using
JVM Heap Pressure: If the JVM heap is constantly under pressure, garbage collection can become a significant bottleneck, slowing down all operations, including indexing.
- Diagnosis: Monitor JVM heap usage in Kibana Monitoring or
_nodes/stats. High heap usage and frequent, long garbage collection pauses are red flags. - Solution: Increase JVM heap size (but not beyond 50% of system RAM and not exceeding 30.5 GB), optimize mappings to reduce document size, or add more nodes to distribute the load.
- Diagnosis: Monitor JVM heap usage in Kibana Monitoring or
Inefficient Mapping: Overly complex mappings, dynamic mapping with many new fields being created, or incorrect data types can increase indexing overhead.
- Diagnosis: Analyze index mappings (
_mappingAPI). Look for nested objects, large numbers of fields, or fields indexed unnecessarily. - Solution: Define explicit mappings with appropriate data types. Use
dynamic: falseordynamic: strictwhere applicable. Avoid deeply nested structures if not essential.
- Diagnosis: Analyze index mappings (
Network Latency: High latency between nodes or between clients and the cluster can slow down bulk indexing requests.
- Diagnosis: Measure network latency between your clients/nodes. Analyze bulk API response times.
- Solution: Keep cluster nodes on a low-latency private network, place bulk clients close to the cluster when possible, and reduce unnecessary cross-region traffic. Request cache settings will not fix network latency.
Suboptimal Bulk API Usage: Sending individual requests instead of using bulk requests, or sending excessively large/small bulk requests, can be inefficient.
- Diagnosis: Monitor the throughput of your bulk indexing. Analyze the size of your bulk requests.
- Solution: Use the Bulk API for all indexing operations. Experiment with bulk size (typically 5-15 MB per bulk request is a good starting point) to find the optimal balance between throughput and latency. Ensure your bulk requests are properly batched.
Translog Durability: The
index.translog.durabilitysetting controls how often the transaction log is flushed to disk.request(default) is safer but can impact performance compared toasync.- Diagnosis: This is a configuration setting.
- Solution: For maximum indexing throughput, consider
asyncdurability. However, be aware that this increases the risk of data loss in case of a node crash between flushes.
2. Slow Queries
Query performance is influenced by shard size, query complexity, caching, and the efficiency of the underlying data structure.
Causes and Solutions:
Large Shards: Shards that are too large can slow down queries as Elasticsearch has to search through more data and merge results from more segments.
- Diagnosis: Check shard sizes using
_cat/shardsor_all/settings?pretty. - Solution: Aim for shard sizes between 10GB and 50GB. Consider reindexing data into a new index with smaller shards or using Index Lifecycle Management (ILM) to manage shard size over time.
- Diagnosis: Check shard sizes using
Too Many Shards: Having an excessive number of small shards can lead to high overhead for the cluster, especially during searches. Each shard requires resources for management.
- Diagnosis: Count the total number of shards per node and per index using
_cat/shards. - Solution: Consolidate indices if possible. Optimize your data model to reduce the number of indices and thus the total number of shards. For time-series data, ILM can help manage shard count.
- Diagnosis: Count the total number of shards per node and per index using
Inefficient Queries: Complex queries, queries that involve heavy scripting, wildcard searches at the beginning of terms, or regular expressions can be very resource-intensive.
- Diagnosis: Use the Profile API (
_search?profile=true) to analyze query execution time and identify slow parts. Analyze slow logs. - Solution: Simplify queries. Avoid leading wildcards and expensive regex. Use
termqueries instead ofmatchfor exact matches where possible. Consider usingsearch_as_you_typeorcompletionsuggesters for type-ahead suggestions. Optimize filter clauses (usefiltercontext instead ofquerycontext for non-scoring queries).
- Diagnosis: Use the Profile API (
Lack of Caching: Insufficient or ineffective caching can lead to repeated computations and data retrieval.
- Diagnosis: Monitor cache hit rates for the query cache and request cache using
_nodes/stats/indices/query_cacheand_nodes/stats/indices/request_cache. - Solution: Ensure appropriate caching is enabled. The filter cache (part of the query cache) is particularly important for repeated filter queries. For frequently executed identical queries, consider enabling the request cache.
- Diagnosis: Monitor cache hit rates for the query cache and request cache using
Segment Merging Overhead: Elasticsearch merges smaller segments into larger ones in the background. This process consumes I/O and CPU resources, which can sometimes impact real-time query performance.
- Diagnosis: Monitor the number of segments per shard using
_cat/segments. - Solution: Avoid changing merge settings casually. During a large backfill, reduce refresh frequency, control bulk concurrency, and watch merge throttling and disk I/O. Force merges are usually for read-only indices, not active hot indices.
- Diagnosis: Monitor the number of segments per shard using
3. Resource Contention (CPU, Memory, Network)
Resource contention is a broad category that can manifest in both indexing and query performance degradation.
Causes and Solutions:
CPU Overload: High CPU usage can be caused by complex queries, intensive aggregations, too many indexing operations, or excessive garbage collection.
- Diagnosis: Monitor CPU usage per node (
_nodes/stats). Identify which operations are consuming the most CPU (e.g., search, indexing, JVM GC). - Solution: Optimize queries and aggregations. Distribute load across more nodes. Reduce indexing rate if it's overwhelming the CPU. Ensure adequate JVM heap settings to minimize GC overhead.
- Diagnosis: Monitor CPU usage per node (
Memory Issues (JVM Heap and System Memory): Insufficient JVM heap leads to frequent GC. Running out of system memory can cause swapping, drastically reducing performance.
- Diagnosis: Monitor JVM heap usage and overall system memory (RAM, swap) on each node.
- Solution: Allocate sufficient JVM heap (e.g., 50% of system RAM, up to 30.5GB). Avoid swapping by ensuring enough free system memory. Consider adding more nodes or using dedicated nodes for specific roles (master, data, ingest).
Network Bottlenecks: High network traffic can slow down inter-node communication, replication, and client requests.
- Diagnosis: Monitor network bandwidth usage and latency between nodes and clients.
- Solution: Optimize network infrastructure. Reduce unnecessary data transfer. Ensure optimal shard allocation and replication settings.
Disk I/O Saturation: As mentioned in indexing, this also impacts query performance when reading data from disk.
- Diagnosis: Monitor disk I/O metrics.
- Solution: Upgrade to faster storage, distribute data across more nodes, or optimize queries to reduce the amount of data read.
Best Practices for Performance Tuning
- Monitor Continuously: Performance tuning is an ongoing process. Regularly monitor your cluster's health and resource utilization.
- Optimize Mappings: Define explicit, efficient mappings tailored to your data. Avoid unnecessary fields or indexing.
- Shard Strategy: Aim for optimal shard sizes (10-50GB) and avoid having too many or too few shards.
- Use Bulk API: Use the Bulk API for indexing and the multi-search API when you need to bundle independent searches.
- Tune JVM Heap: Allocate sufficient heap, but do not over-allocate. Avoid swapping.
- Understand Query Performance: Profile queries, simplify them, and leverage the filter context.
- Leverage Caching: Ensure query and request caches are used effectively.
- Hardware: Use SSDs for storage and ensure adequate CPU and RAM.
- Dedicated Nodes: Consider using dedicated nodes for master, data, and ingest roles to isolate workloads.
- Index Lifecycle Management (ILM): For time-series data, ILM is essential for managing indices, rolling over shards, and eventually deleting old data, which helps control shard count and size.
When you find a bottleneck, make the smallest change that directly addresses it. Add nodes when the cluster is genuinely out of capacity. Fix mappings when heap is being wasted. Rewrite queries when the profile output points at expensive clauses. Adjust shard strategy when one node is doing work that should be spread out. That discipline keeps performance work from becoming a pile of unrelated tuning knobs.