Why Is Redis Using High CPU? Debugging and Optimization Techniques

Investigate sudden high CPU utilization in Redis, a critical in-memory data store. This guide details how to debug load using `SLOWLOG` and `INFO` commands to pinpoint inefficient operations like `KEYS *` or large key deletions. Learn practical optimization techniques, including switching to asynchronous `UNLINK`, utilizing pipelining, and tuning persistence settings, to immediately reduce server load and restore optimal Redis performance.

Why Is Redis Using High CPU? Debugging and Optimization Techniques

High Redis CPU usually means one of three things: Redis is doing too much command work on its main execution path, background work such as persistence is adding pressure, or clients are sending traffic in a shape Redis cannot process efficiently. The fix depends on which one is true.

Do not start by restarting Redis unless the service is already falling over. A restart may clear the symptom and erase the evidence. Start by capturing command latency, command mix, client count, persistence state, and host CPU. Those facts tell you whether you have a bad command, a bad traffic pattern, an overloaded single core, or a noisy host.

Understanding Redis Architecture and CPU Load

Redis is often described as single-threaded, which is mostly true for command execution, but modern Redis can also use background threads and optional I/O threading. The practical point is still the same: a command that takes too long can delay other clients, and one saturated core can be enough to create visible latency even when the machine has idle CPU elsewhere.

Key Factors Influencing Redis CPU Load

Common causes are expensive commands, large values, Lua scripts, too many small commands sent one round trip at a time, heavy connection churn, persistence activity, and memory pressure that forces the kernel to work harder than Redis expects.

Debugging High CPU Utilization

Before optimizing, you must accurately identify the source of the load. Monitoring tools and built-in Redis commands are essential for diagnosis.

1. Using INFO and LATENCY Commands

The INFO command provides a snapshot of server status. Focus on the CPU section and command statistics.

redis-cli INFO cpu

Look at the rate of change, not only the absolute values. used_cpu_user increasing quickly often points to command processing. used_cpu_sys increasing quickly can point to kernel work such as networking, memory management, or disk-related activity.

The latency tools show event classes Redis has observed:

redis-cli LATENCY LATEST
redis-cli LATENCY DOCTOR

2. Identifying Slow Commands with SLOWLOG

The Redis Slow Log records commands that exceed a specified execution time. This is your most direct tool for finding poorly performing operations.

The Redis slow log records commands whose execution time exceeds a threshold. It does not include network time or time waiting in the client pool, so it is best used alongside application latency metrics.

Example Configuration:

slowlog-log-slower-than 1000
slowlog-max-len 1024

Retrieving the Log:

redis-cli SLOWLOG GET 10

Review the command name, key names, and duration. If KEYS, large HGETALL, huge SMEMBERS, broad sorted-set ranges, or Lua scripts dominate the log, the CPU issue is probably application-driven.

3. Monitoring Network and Client Activity

MONITOR is tempting during an incident, but it is expensive on a busy server. Prefer INFO commandstats, INFO clients, slow log, client-library metrics, and sampling from a replica if you have one.

Useful commands:

redis-cli INFO commandstats
redis-cli INFO clients
redis-cli CLIENT LIST

If command volume doubled after a deploy, you may see it in cmdstat_get, cmdstat_hgetall, or similar counters. If clients are constantly connecting and disconnecting, fix pooling before tuning Redis.

Common Causes and Optimization Strategies

Once you have identified problematic commands or processes, apply targeted optimization techniques.

1. Eliminating Blocking Commands

The fastest wins usually come from removing commands that force Redis to walk a huge keyspace or serialize a huge value.

Inefficient Command Why it causes high CPU Optimization / Alternative
KEYS * Scans the entire key space. O(N). Use SCAN iteratively or restructure data access.
FLUSHALL / FLUSHDB Deletes every key unless async mode is used. Use careful scoped deletion, UNLINK, or async flush only when appropriate.
HGETALL, SMEMBERS (on very large sets) Retrieves the entire structure into memory and serializes it. Use HSCAN, SSCAN, or break down large structures into smaller keys.

Use UNLINK instead of DEL for very large keys. DEL frees memory synchronously. UNLINK removes the key from the keyspace and frees memory asynchronously, which usually reduces visible latency during large deletions.

# Instead of DEL large_key
UNLINK large_key

2. Optimizing Persistence (RDB and AOF)

RDB snapshots and AOF rewrites use background children and can still affect the parent through fork cost, copy-on-write memory, disk bandwidth, and CPU contention.

  • RDB Snapshots: If you are frequently saving (e.g., every minute), the repeated fork() calls will cause recurring CPU spikes. Reduce the frequency of automatic saves.
  • AOF Rewriting: AOF rewriting (BGREWRITEAOF) is also resource-intensive. Redis attempts to optimize this by performing minimal I/O, but CPU usage will rise during the process.

If persistence lines up with CPU spikes, check INFO persistence and host disk metrics. You can reduce RDB frequency, schedule heavy backups away from traffic peaks, leave more memory headroom, or improve storage. Pausing persistence can reduce load, but it also increases data loss risk, so it should be a deliberate operational decision.

3. Handling Memory Fragmentation and Swapping

While memory issues are often associated with high memory usage, severe memory fragmentation or, worse, the operating system starting to swap Redis data to disk (thrashing) will drastically increase CPU usage as the kernel fights to manage memory.

  • Check Swapping: Use OS tools (vmstat, top) to check if the system is actively swapping memory pages belonging to the Redis process.
  • Memory Fragmentation Ratio: Check mem_fragmentation_ratio in INFO memory. A high ratio is a clue that allocator behavior may be wasting memory, but confirm with RSS, dataset size, and host memory metrics.

If swapping occurs, reduce the dataset, lower maxmemory, move work off the host, or add memory. Redis is not designed to perform well when its hot dataset is being paged to disk.

4. Network Optimization and Pipelining

If CPU load tracks a high number of small commands, the problem may be command overhead and network churn rather than one obviously slow command.

Pipelining lets a client send multiple commands without waiting for a response after each one. It reduces round trips and can improve throughput for bulk writes or reads. Keep pipeline batches bounded; a pipeline with thousands of heavy commands can create its own latency spike.

Best Practices for Sustained Performance

To prevent future CPU spikes, adopt these architectural and configuration best practices:

  1. Use UNLINK for keys that may be large.
  2. Replace KEYS with SCAN, and replace full collection reads with cursor-based reads.
  3. Track INFO commandstats after deployments so a new command pattern does not surprise you.
  4. Tune persistence with actual disk and memory headroom in mind.
  5. If one Redis instance is legitimately saturated after command fixes, split the workload with Redis Cluster, client-side sharding, separate cache/session instances, or a larger instance with better single-core performance.

A quick incident checklist

During a spike, run:

redis-cli INFO cpu
redis-cli INFO commandstats
redis-cli INFO clients
redis-cli INFO memory
redis-cli INFO persistence
redis-cli SLOWLOG GET 20
redis-cli LATENCY LATEST

Then line those results up with application deploys, cron jobs, traffic changes, persistence events, and host metrics. High Redis CPU is usually fixable, but the fix is specific: remove the expensive command, batch the chatty client, stop connection churn, give persistence room to work, or split the workload when a single instance is genuinely at its limit.