Four Essential Strategies to Troubleshoot Redis Memory Leaks and Spikes
Memory leaks and sudden spikes can cripple Redis performance. This expert guide provides four essential strategies to proactively manage and troubleshoot memory consumption. Learn how to leverage `INFO` and `MEMORY USAGE` commands for deep diagnostics, implement effective `maxmemory` eviction policies, identify and prune massive keys causing unexpected growth, and resolve system-level fragmentation issues using Active Defragmentation. Stabilize your cache performance and ensure the reliability of your in-memory data store with these proven, actionable techniques.
Four Essential Strategies to Troubleshoot Redis Memory Leaks and Spikes
Redis is an in-memory data store, so memory problems show up fast. A small mistake in TTL handling, one oversized list, or a background save on a host with no spare RAM can turn into latency, write errors, evictions, swapping, or a Redis process killed by the operating system.
The first useful habit is to stop calling every memory increase a leak. True Redis leaks are uncommon. Most incidents are one of three things: real data growth, allocator fragmentation, or temporary copy-on-write overhead during persistence. They look similar on a dashboard, but the fixes are completely different.
If used_memory keeps climbing, your application is probably storing more data than expected. If used_memory is stable but used_memory_rss jumps, look at fragmentation, forked background work, or the operating system. If both climb during a traffic spike and then never fall, check TTLs, eviction policy, and large keys.
Strategy 1: Detailed Monitoring of Usage and Fragmentation Metrics
The first step in diagnosing any memory issue is establishing a baseline and understanding how Redis is reporting memory usage. The standard INFO memory command provides essential metrics that differentiate between memory utilized by data and memory utilized by the operating system.
Key metrics for diagnosis
When a spike occurs, look immediately at these metrics from INFO memory:
used_memory: memory Redis allocated for data and internal structures.used_memory_dataset: memory used by the actual dataset, excluding some overhead.used_memory_rss: resident memory the operating system has assigned to the Redis process.mem_fragmentation_ratio: a rough comparison of RSS to allocated memory. Treat it as a clue, not a verdict.
# Check basic memory stats
redis-cli INFO memory
# Sample output snippet
# used_memory:1073741824 # 1 GB of data
# used_memory_rss:1509949440 # ~1.5 GB in RAM
# mem_fragmentation_ratio:1.40625 # RSS is about 40% higher than used_memory
Interpreting the fragmentation ratio
Ratio near 1.0 is usually healthy. A ratio above 1.5 is worth investigating, especially if RSS is high enough to threaten the host. A ratio below 1.0 does not automatically prove swapping; it can happen because of measurement edge cases, shared memory accounting, or very small datasets. Check OS swap metrics directly with vmstat, top, sar, or your monitoring system.
If used_memory is flat but RSS spikes during BGSAVE or BGREWRITEAOF, copy-on-write is a likely cause. The child process is writing a persistence file while the parent continues handling writes. Pages changed by the parent may need to be copied, which temporarily increases memory pressure.
Strategy 2: Implementing Robust Eviction Policies
Unbounded growth is the single most frequent cause of perceived memory "leaks" in Redis. If the instance is used as a cache, it must have a defined ceiling for memory usage, enforced by the maxmemory directive.
If maxmemory is not set, Redis can keep allocating memory until the host is under pressure. On a dedicated Redis box that may end with the kernel killing Redis. In a container, the container runtime may kill it sooner.
Setting maxmemory and Policy Selection
Specify the maximum memory limit in your redis.conf or using CONFIG SET:
# Set max memory to 4 GB. Leave headroom for Redis overhead, forked children,
# the OS page cache, and other processes.
CONFIG SET maxmemory 4gb
# Configure the eviction policy
# allkeys-lru: Evict the least recently used keys across the *entire* dataset
CONFIG SET maxmemory-policy allkeys-lru
| Policy Name | Description | Use Case |
|---|---|---|
noeviction |
Default. Returns errors on write commands when memory limit is reached. | Databases where no data loss is acceptable. |
allkeys-lru |
Evicts the least recently used keys regardless of expiration. | General-purpose caching. |
volatile-lru |
Evicts the least recently used keys only among those with an expiration set. | Mixed use cases (persisted data + cache data). |
allkeys-random |
Evicts random keys when the limit is reached. | Simple session stores or where access pattern is unpredictable. |
For a pure cache, allkeys-lru or allkeys-lfu is often a reasonable starting point. For a mixed Redis instance where only some keys are disposable, volatile-lru or another volatile-* policy may be safer, but only if every cache key has an expiration. The dangerous setup is a cache with noeviction, no TTL discipline, and no alert before memory is full.
Strategy 3: Diagnosing and Pruning Large Key Spikes
Sometimes the problem is not the number of keys. It is one key that grew without a boundary: a user feed list that never trims, a sorted set of every event ever seen, or a hash used as a dumping ground for session fields.
Using redis-cli --bigkeys
The redis-cli --bigkeys utility scans the keyspace and reports large keys by type and element count. It does not measure exact byte size, and it can add load on a busy production instance, so run it carefully or against a replica when possible.
# Run the bigkeys analysis
redis-cli --bigkeys
# Sample Output (identifying a massive List)
---------- Summary ----------
...
[5] Biggest list found 'user:1001:feed' with 859387 items
Using MEMORY USAGE (Redis 4.0+)
To determine the precise size of a suspect key in bytes, use the MEMORY USAGE command. This is vital for deep diagnostics.
# Check the memory usage of a specific key (in bytes)
redis-cli MEMORY USAGE user:1001:feed
# Output: (e.g.) 84329014
If you identify large keys, review the write path. Common fixes are trimming lists with LTRIM, expiring transient structures, splitting very large hashes or sorted sets into smaller keyed partitions, and replacing "load everything" reads with paged access such as HSCAN, SSCAN, or ZSCAN. The real fix is usually in application behavior, not a Redis knob.
Strategy 4: Managing Memory Fragmentation and Copy-on-Write
High fragmentation or sudden RSS spikes are often mistaken for data leaks. These problems relate to memory allocation, object churn, and fork-based persistence.
Active Defragmentation
Active defragmentation can help Redis reclaim wasted allocator space while the server keeps running. It is useful for workloads that create and delete many differently sized values. It also uses CPU, so enable it deliberately and watch latency after the change.
Enable and configure it in redis.conf:
# Enable active defragmentation
activedefrag yes
# Lower and upper thresholds are percentage-style config values.
active-defrag-threshold-lower 10
active-defrag-threshold-upper 100
Reducing Copy-on-Write Overhead
When Redis forks a child process for RDB snapshots or AOF rewrites, the OS uses CoW optimization. If the parent process performs heavy writes while the child process is active, every written page must be duplicated, temporarily spiking used_memory_rss. This spike can easily double the Redis memory footprint.
Mitigation Steps:
- Schedule persistence during low-traffic periods.
- Leave memory headroom above
maxmemoryfor allocator overhead, clients, replication buffers, and forked children. The right margin depends on dataset size and write rate; measure it during a realBGSAVEorBGREWRITEAOF. - Avoid overlapping heavy background work such as snapshots, AOF rewrites, backups, and host-level scans.
- Reduce write churn during persistence if a batch job is causing copy-on-write growth.
Do not reach for allocator environment variables as a first response. Redis is commonly built with jemalloc, and changing allocator behavior without testing can create new latency or memory behavior. If fragmentation remains severe after active defrag and workload fixes, test changes on a staging instance or replica before touching production.
A practical incident flow
When memory jumps, collect the facts before restarting Redis. A restart may hide the evidence.
Run:
redis-cli INFO memory
redis-cli INFO persistence
redis-cli DBSIZE
redis-cli --bigkeys
Then ask what changed. Did a deploy remove TTLs? Did a queue consumer stop, causing lists to grow? Did a new reporting job run HGETALL on huge hashes? Did an AOF rewrite start during the traffic peak? Did the container memory limit change?
The best Redis memory fixes are usually plain: set a realistic maxmemory, choose an eviction policy that matches the workload, give every cache key a TTL, break up unbounded structures, keep persistence from running with no memory headroom, and alert on memory trends before the instance reaches the cliff.