Guide to Elasticsearch Indexing Performance: Best Practices Unveiled

Improve Elasticsearch indexing with bulk requests, refresh and replica tuning, mapping choices, hardware checks, and shard planning.

Guide to Elasticsearch Indexing Performance: Best Practices Unveiled

Elasticsearch indexing performance becomes visible when your ingest pipeline starts backing up, bulk requests get rejected, or searches slow down during heavy writes. The fix is rarely one magic setting; you need to tune request size, refresh behavior, mappings, shard layout, and hardware together.

This guide focuses on practical Elasticsearch indexing performance checks you can apply before and during a large ingest job. Use them with metrics from your own cluster, because document size, analyzers, storage, and replica count can change the result.

Understanding the Indexing Process

Before diving into optimization, it's essential to grasp how Elasticsearch handles indexing. When a document is indexed, Elasticsearch performs several operations: parsing the document, analyzing the fields (tokenization, stemming, etc.), and then storing the inverted index and other data structures. These operations, especially analysis and disk I/O, are CPU and I/O intensive. In a distributed environment, these operations are handled by individual nodes, making cluster-wide configuration and node resources critical.

Key Factors Influencing Indexing Speed

Several factors can significantly impact how quickly Elasticsearch can index documents:

  • Hardware Resources: CPU, RAM, and especially disk I/O speed are paramount. SSDs are highly recommended over HDDs for their superior read/write performance.
  • Cluster Configuration: Shard allocation, replication settings, and node roles play a role.
  • Indexing Strategy: The method used to send data (e.g., single document requests vs. bulk API).
  • Mapping and Data Types: How your fields are defined and their corresponding data types.
  • Refresh Interval: How often data becomes visible for search.
  • Translog Settings: Durability settings for acknowledged writes.

Optimizing Indexing Performance: Best Practices

This section covers actionable strategies to enhance your Elasticsearch indexing throughput.

1. Leverage the Bulk API

The most fundamental optimization for indexing is to use the Bulk API. Instead of sending individual indexing requests, which incur network overhead and processing cost per request, the Bulk API allows you to send a list of operations (index, create, update, delete) in a single HTTP request. This significantly reduces network latency and improves overall throughput.

Best Practices for Bulk API:

  • Batch Size: Experiment with batch sizes. Start with modest payloads, then increase while watching indexing latency, memory pressure, and 429 rejections. Document count alone is not enough because one document may be tiny and another may be several megabytes.
  • Concurrency: Use multiple threads or asynchronous clients to send bulk requests concurrently. However, avoid overwhelming your cluster. Monitor CPU and I/O usage to find the sweet spot.
  • Error Handling: Implement robust error handling. The Bulk API returns an array of responses, and you need to check each operation's status.

Example Bulk Request:

{ "index": { "_index": "my-index", "_id": "1" } }
{ "field1": "value1", "field2": "value2" }
{ "index": { "_index": "my-index", "_id": "2" } }
{ "field1": "value3", "field2": "value4" }

2. Tune Indexing Settings

Elasticsearch provides several settings that can be adjusted to optimize the indexing process. These are typically set on a per-index basis.

Refresh Interval (index.refresh_interval)

The refresh interval controls how often data becomes visible for search. Commonly, active indices refresh about once per second when they are being searched, but defaults can vary by version and index type. During heavy indexing, you can increase this interval to reduce refresh work. Setting it to -1 disables automatic refreshes, meaning data will not become searchable until you manually refresh or restore automatic refreshes.

  • Recommendation: For bulk indexing operations, temporarily increase index.refresh_interval or set it to -1 when search freshness is not required. After the bulk operation is complete, restore the setting you use for normal search behavior and run a manual refresh if needed.

Example using Index Settings API:

# Temporarily disable refresh
PUT /my-index/_settings
{
  "index" : {
    "refresh_interval" : "-1"
  }
}

# ... perform bulk indexing ...

# Re-enable refresh
PUT /my-index/_settings
{
  "index" : {
    "refresh_interval" : "1s"
  }
}

Translog Durability (index.translog.durability)

The translog is a write-ahead log that ensures data durability. It can be set to request (default) or async. Setting it to async flushes the translog asynchronously, which can improve indexing speed but carries a slight risk of data loss if a node fails before the translog is written to disk.

  • Recommendation: For bulk import scenarios where durability is less critical than speed, async can be beneficial. Always consider your application's tolerance for data loss.

Number of Replicas (index.number_of_replicas)

Replicas are copies of your primary shards, used for high availability and read scaling. However, each replica needs to process every indexing operation. During initial large data loads, setting index.number_of_replicas to 0 can significantly speed up indexing. After the data is loaded, you can increase the replica count.

Example during bulk load:

# Temporarily set replicas to 0
PUT /my-index/_settings
{
  "index" : {
    "number_of_replicas" : "0"
  }
}

# ... perform bulk indexing ...

# Restore replicas (e.g., to 1)
PUT /my-index/_settings
{
  "index" : {
    "number_of_replicas" : "1"
  }
}

3. Optimize Mappings

Mappings define how documents and their fields are stored and indexed. Poorly designed mappings can lead to performance issues.

  • Avoid Dynamic Mapping for Large Datasets: While convenient, dynamic mapping can lead to mapping explosions and unexpected field types. Define explicit mappings for your indices, especially for high-volume data.
  • Choose Appropriate Data Types: Use the most efficient data types. For example, keyword is more efficient for exact value matching than text if full-text search isn't required.
  • Disable Unnecessary Features: If you don't need features like norms for a specific field (e.g., for exact matches or aggregations), disabling them can save space and improve indexing speed (norms: false). Similarly, disable doc_values if not needed for sorting or aggregations on a field. However, doc_values are generally beneficial for aggregations and sorting, so this is a nuanced decision.
  • _source Field: If you don't need the original JSON document, disabling _source can save disk space and some I/O, but it prevents reindexing and makes debugging harder. Consider _source compression if you keep it enabled.

Example Mapping (with explicit types and disabled norms):

PUT /my-index
{
  "mappings": {
    "properties": {
      "timestamp": {"type": "date"},
      "message": {"type": "text", "norms": false},
      "user_id": {"type": "keyword"}
    }
  }
}

4. Hardware and Infrastructure Considerations

Even with perfect software configurations, inadequate hardware will limit indexing speed.

  • Disk I/O: Use fast SSDs. NVMe SSDs offer the best performance. Avoid network-attached storage (NAS) for indexing nodes if possible.
  • CPU and RAM: Sufficient CPU cores are needed for analysis, and ample RAM helps with caching and overall JVM performance.
  • Ingest and coordinating capacity: For very high ingestion rates, consider dedicated ingest nodes for pipelines or coordinating nodes for client bulk traffic. Data nodes still do the actual indexing work, so do not starve them of CPU, memory, or disk I/O.
  • Network: Ensure sufficient bandwidth and low latency between your clients and Elasticsearch nodes, and between nodes in the cluster.

5. Shard Sizing and Count

While not directly an indexing setting, the number and size of shards impact performance. Too many small shards can increase overhead. Conversely, a single massive shard can be difficult to manage and may not scale well. Aim for shard sizes between 10GB and 50GB for optimal performance, but this can vary.

  • Recommendation: Plan your primary shard count before indexing large amounts of data. It's generally not recommended to change the number of primary shards on an existing index without reindexing.

6. Index Lifecycle Management (ILM)

For time-series data, using Index Lifecycle Management (ILM) is crucial. While ILM primarily helps manage indices over time (rollover, shrink, delete), the rollover action can be configured to create new indices based on size or age. This ensures that indices remain within optimal size ranges, which indirectly benefits indexing performance.

  • Rollover: When an index reaches a certain size or age, ILM can automatically create a new, empty index and switch the data stream alias to it. This allows you to optimize settings for the new index (e.g., lower replicas during initial bulk load) and keep active indices manageable.

Practical Takeaway

Start with bulk indexing, explicit mappings, and enough disk I/O. For one-time loads, relax refreshes and replicas only while you can tolerate reduced search freshness or redundancy, then restore normal settings and verify cluster health. Keep testing with your real documents; generic batch sizes and shard counts are only starting points.