Best Practices for Elasticsearch Daily Backup and Restore Operations

Establish a reliable Elasticsearch daily backup strategy using this comprehensive guide. Learn how to configure durable repositories, automate routine snapshots with Snapshot Lifecycle Management (SLM), and leverage Index Lifecycle Management (ILM) for long-term archiving. This article details best practices for security, performance throttling, and the crucial steps for regular restoration testing, ensuring your data is protected and recoverable under any circumstance.

Best Practices for Elasticsearch Daily Backup and Restore Operations

Daily backups protect your Elasticsearch cluster from the failures replicas cannot fix: accidental deletes, bad mappings, corrupt data, failed upgrades, and full-cluster loss. Replicas help availability, but snapshots are what let you go back to a known good copy.

These best practices for Elasticsearch daily backup and restore operations cover repository setup, Snapshot Lifecycle Management (SLM), restore testing, and the places where Index Lifecycle Management (ILM) fits in.

Understanding the Elasticsearch Snapshot Mechanism

Elasticsearch snapshots are not simply file copies; they are incremental, leveraging the internal structure of Lucene indices. This means that after the initial full snapshot, subsequent snapshots only store data segments that have changed since the last successful snapshot, making them highly efficient in terms of time and storage.

Snapshots capture two primary components:

  1. Index Data: The actual Lucene segments for selected indices.
  2. Cluster State: Metadata, persistent settings, index templates, pipelines, and roles.

1. Establishing the Snapshot Repository

Before taking any snapshot, you must register a repository: the secure location where snapshot files will be stored. The choice of repository is crucial for durability and recovery speed.

Repository Types

Repository Type Description Best for Requirements
fs (Shared File System) Local or network-mounted drive accessible by all master and data nodes. Small clusters, quick local backups. Must be registered in elasticsearch.yml (path.repo).
s3, azure, gcs Cloud storage services. Some distributions and versions bundle these repository types; others require installing the matching repository plugin on every node. Production environments, disaster recovery. Version-appropriate repository support and proper IAM/service principal credentials.

Example: Registering an S3 Repository

For production environments, cloud storage is usually the better choice for durability and off-site recovery. Confirm repository support for your Elasticsearch version, configure credentials securely, then register the repository through the API.

PUT /_snapshot/my_s3_daily_repo
{
  "type": "s3",
  "settings": {
    "bucket": "es-backup-bucket-name",
    "region": "us-east-1",
    "base_path": "daily_snapshots/production",
    "compress": true
  }
}

Tip: Ensure the configured bucket or file system path is secure, immutable (if supported by your provider), and exclusively used for backups.

2. Implementing Daily Automation with SLM

Manual snapshots are acceptable for one-off operations, but routine daily backups must be automated using Snapshot Lifecycle Management (SLM). SLM is the native mechanism within Elasticsearch designed specifically for defining schedules, retention policies, and management of snapshots.

Defining an SLM Policy

A typical daily policy defines a schedule, the indices to include (or exclude), and how long to retain the snapshots.

PUT /_slm/policy/daily_archive_policy
{
  "schedule": "0 30 1 * * ?", 
  "name": "<daily-{{now/d}}>",
  "repository": "my_s3_daily_repo",
  "config": {
    "indices": ["logstash-*", "application-metrics-*"],
    "ignore_unavailable": true,
    "include_global_state": false 
  },
  "retention": {
    "expire_after": "30d", 
    "min_count": 5, 
    "max_count": 30 
  }
}

Key SLM Configuration Points:

  • schedule: Uses Quartz cron syntax (e.g., 0 30 1 * * ? runs daily at 01:30 AM). Schedule during low usage hours.
  • include_global_state: false: For daily data backups, it is often best to exclude the cluster state to prevent accidental state rollback during restoration.
  • retention: Defines the cleaning schedule. The example above retains snapshots for 30 days, ensuring at least 5 and no more than 30 are kept.

Monitoring SLM

Regularly check the status of your policies to ensure they are executing successfully.

GET /_slm/status
GET /_slm/policy/daily_archive_policy

3. Integrating with Index Lifecycle Management (ILM)

For large time-series data, such as logs and metrics, Index Lifecycle Management (ILM) manages indices from creation to deletion. ILM does not replace daily SLM snapshots, but it can help you coordinate deletion with a completed snapshot policy.

ILM and Data Tiering

Before ILM deletes old indices, you can make the delete phase wait for an SLM policy to run. That gives your daily or long-term snapshot policy a chance to capture the data before the index is removed from the cluster.

  1. Create an SLM policy that snapshots the relevant index pattern.
  2. Reference that SLM policy from the ILM delete phase with wait_for_snapshot.
...
"delete": {
  "min_age": "90d",
  "actions": {
    "wait_for_snapshot": {
      "policy": "daily_archive_policy"
    },
    "delete": {}
  }
}
...

This waits for a successful snapshot from the named SLM policy before ILM deletes the index. If you use data streams, test the lifecycle flow in staging so you know which backing indices are covered by the snapshot policy.

If your goal is to keep older searchable data on cheaper storage instead of deleting it, look at the searchable snapshot action in the cold or frozen phase. That is a different pattern from a plain disaster-recovery snapshot:

...
"cold": {
  "min_age": "30d",
  "actions": {
    "searchable_snapshot": {
      "snapshot_repository": "my_s3_daily_repo"
    }
  }
}
...

Use one lifecycle pattern per goal: SLM for recoverable backups, wait_for_snapshot before deletion, and searchable snapshots when you need lower-cost search access.

4. Best Practices for Restoration Testing

A backup routine is incomplete without a proven recovery strategy. You must regularly test your restoration process to validate data integrity and meet Recovery Time Objective (RTO) goals.

Restoration Testing Environment

  • Do not test restores directly onto a live production cluster. Use a dedicated staging or testing environment that runs a compatible Elasticsearch version and has enough disk space for restored shards.
  • Frequency: Test restoration at least quarterly, or after major upgrades/configuration changes.

Executing a Restoration

Restoration can target specific indices or the entire cluster state.

Step 1: Get Snapshot Details

Identify the snapshot name you need to restore.

GET /_snapshot/my_s3_daily_repo/_all

Step 2: Execute the Restore Operation

To restore specific indices, use the indices parameter. It is often necessary to rename indices during restoration to avoid conflict with active indices (especially in a test environment).

POST /_snapshot/my_s3_daily_repo/snapshot_20240501/_restore
{
  "indices": ["logstash-2024-05-01"],
  "rename_pattern": "(.+)",
  "rename_replacement": "restored-$1",
  "include_aliases": false
}

Verifying Restoration Success

After restoration, verify shard health and document counts. A count match is useful, but also run a sample query that your application depends on.

GET /_cluster/health/restored-logstash-2024-05-01?wait_for_status=green&timeout=60s
GET /restored-logstash-2024-05-01/_count

5. Security and Performance Considerations

Security

  • Repository Access: Ensure the credentials used by Elasticsearch to access the repository follow least privilege for the repository path. In practice, snapshot management needs read, write, list, and delete permissions for the objects it owns, especially when retention deletes old snapshots.
  • Encryption: Utilize secure repositories (like S3) with server-side encryption enabled (SSE-S3 or SSE-KMS).

Performance Throttling

Snapshots can be I/O intensive. If you notice performance degradation during the snapshot window, throttle snapshot traffic at the repository level. For many repository types, the relevant settings are max_snapshot_bytes_per_sec and max_restore_bytes_per_sec when you register or update the repository.

PUT /_snapshot/my_s3_daily_repo
{
  "type": "s3",
  "settings": {
    "bucket": "es-backup-bucket-name",
    "region": "us-east-1",
    "base_path": "daily_snapshots/production",
    "max_snapshot_bytes_per_sec": "100mb",
    "max_restore_bytes_per_sec": "100mb"
  }
}

indices.recovery.max_bytes_per_sec controls peer and snapshot recovery traffic, so tune it only when you understand the effect on shard recovery. Keep snapshot schedules outside peak indexing and search windows when possible.

Daily Backup Workflow

  1. Configure Durable Repository: Use cloud storage (S3/Azure/GCS) for production environments.
  2. Define SLM Policy: Schedule snapshots (e.g., daily at 1:30 AM) using SLM, ensuring appropriate retention rules are set.
  3. Coordinate ILM (if applicable): Use wait_for_snapshot before deletion or searchable snapshots for lower-cost searchable history.
  4. Monitor Status: Regularly verify SLM policy execution via the _slm/policy and _slm/status APIs.
  5. Test Recovery: Quarterly or bi-annually, perform a full restoration to a segregated environment to validate RTO readiness.

The useful backup is the one you can restore. Keep the daily SLM policy simple, monitor failures, and schedule restore drills often enough that your team knows the exact steps before an incident.