Efficiently Managing Data Using the Elasticsearch _bulk API Command

Use the Elasticsearch _bulk API correctly with NDJSON examples, response checks, batch sizing, and safe retry guidance.

Efficiently Managing Data Using the Elasticsearch _bulk API Command

The Elasticsearch _bulk API is the right tool when your app needs to index, update, or delete many documents without paying for one HTTP request per document. The part that trips people up is the request body: it is newline-delimited JSON, not one pretty-printed JSON array.

Use _bulk when you are loading logs, syncing records from another database, or applying a batch of cleanup deletes. You still need to inspect every item in the response, because one operation can fail while the overall HTTP request succeeds.

Understanding the _bulk API Structure

The _bulk API accepts newline-delimited JSON, usually called NDJSON. Each action is defined on one line. Actions that need a document body use the next line as the source or update payload. The final line must also end with a newline.

Key Components of a _bulk Request:

  • Action and Metadata Line: This line specifies the operation type (index, create, update, or delete), the target index, and optionally the document ID. Document types are not used in modern Elasticsearch APIs.
  • Source Line: This line contains the actual JSON document to be indexed or updated. This line is omitted for delete operations.
  • Newline Delimiter: Each action/metadata pair and its corresponding source (if applicable) must be separated by a newline character (\n). The entire request body should end with a newline character.

Example Structure:

{ "index": { "_index": "my-index", "_id": "1" } }
{ "field1": "value1" }
{ "delete": { "_index": "my-index", "_id": "2" } }

Or for a delete operation:

curl -sS -H 'Content-Type: application/x-ndjson' \
  -X POST 'http://localhost:9200/_bulk' \
  --data-binary @bulk.ndjson

Performing Common Operations with _bulk

The _bulk API is versatile and can handle a mix of operations within a single request. This is where its true power lies, allowing you to perform complex data manipulation in a single round trip.

Indexing Multiple Documents

To index multiple documents, you use the index action. If a document with the specified ID already exists, index will overwrite it. If you want to ensure a document is only indexed if it doesn't already exist, use the create action instead.

Example: Indexing two new documents.

{ "index": { "_index": "my-index", "_id": "1" } }
{ "field1": "value1", "field2": "value2" }
{ "index": { "_index": "my-index", "_id": "2" } }
{ "field1": "another_value", "field2": "different_value" }

Updating Documents

Updating documents can be done using the update action. You specify the document ID to be updated and provide a partial document with the fields you want to change. If you want to use a script for updating, you can do so within the update action.

Example: Updating a field in an existing document.

{ "update": { "_index": "my-index", "_id": "1" } }
{ "doc": { "field1": "updated_value" } }

Deleting Documents

To delete documents, you use the delete action, specifying the _index and _id of the document to be removed. No source document is required for delete operations.

Example: Deleting a document.

{ "delete": { "_index": "my-index", "_id": "2" } }

Combining Operations

The real efficiency comes from mixing these operations. You can index new documents, update existing ones, and delete others all in the same _bulk request.

Example: Indexing, updating, and deleting in one request.

{ "index": { "_index": "my-index", "_id": "3" } }
{ "field1": "new_document_field", "field2": "new_document_value" }
{ "update": { "_index": "my-index", "_id": "1" } }
{ "doc": { "field1": "further_updated_value" } }
{ "delete": { "_index": "my-index", "_id": "2" } }

Response Handling

The _bulk API returns a JSON response that details the outcome of each individual operation. It's crucial to parse this response to verify that all operations were successful and to identify any errors.

The response will contain an items array, where each element corresponds to one of the operations in your request, in the same order. Each item will include the index, create, update, or delete operation, along with its status (e.g., created, updated, deleted, noop) and other relevant metadata.

Example Response Snippet:

{
  "took": 150,
  "errors": false,
  "items": [
    {
      "index": {
        "_index": "my-index",
        "_id": "3",
        "version": 1,
        "result": "created",
        "_shards": {"total": 2, "successful": 1, "failed": 0},
        "_seq_no": 0,
        "_primary_term": 1
      }
    },
    {
      "update": {
        "_index": "my-index",
        "_id": "1",
        "version": 2,
        "result": "updated",
        "_shards": {"total": 2, "successful": 1, "failed": 0},
        "_seq_no": 1,
        "_primary_term": 1
      }
    },
    {
      "delete": {
        "_index": "my-index",
        "_id": "2",
        "version": 2,
        "result": "deleted",
        "_shards": {"total": 2, "successful": 1, "failed": 0},
        "_seq_no": 2,
        "_primary_term": 1
      }
    }
  ]
}

If any operation fails, the top-level errors field in the response will be true, and the individual item for the failed operation will contain an error object detailing the issue.

Best Practices and Tips

  • Batch Size: Very large batches can strain client memory, coordinating nodes, and data nodes. Start with modest payloads, measure throughput and rejection rates, then adjust for your cluster and document size.
  • Error Handling: Always parse the response for errors. Implement retry logic for transient errors if necessary.
  • Newline Delimiters: Ensure that newline characters (\n) are correctly used between each JSON object. Incorrect formatting is a common cause of _bulk API failures.
  • Parallelization: For very high ingestion rates, consider sending multiple _bulk requests in parallel, but be mindful of your cluster's capacity.
  • create vs. index: Use create when you want the operation to fail if the ID already exists. Use index when replacing an existing document is acceptable.
  • API Clients: Most Elasticsearch client libraries provide convenient methods for constructing and executing _bulk requests, abstracting away some of the manual formatting.

Practical Takeaway

The _bulk API is fast because it reduces request overhead, but it is only safe when your client treats the response as a list of individual outcomes. Send valid NDJSON with Content-Type: application/x-ndjson, keep batches at a size your cluster can absorb, and retry only the operations that failed for transient reasons.