Troubleshooting MongoDB Replication Lag: Causes and Solutions
Learn how to diagnose and resolve replication lag in MongoDB replica sets. This guide covers common causes, including high write loads, hardware bottlenecks, and network issues. Discover actionable monitoring techniques using `rs.printReplicationInfo()` and practical solutions to maintain data synchronization, ensuring high availability and read consistency across all your database nodes.
Troubleshooting MongoDB Replication Lag: Causes and Solutions
MongoDB replication lag usually starts as a small operational annoyance. A chart begins climbing. A secondary falls behind by 15 seconds, then 2 minutes. Someone asks whether reads are stale. Someone else suggests restarting the node. Before you do that, slow down and figure out which part of replication is losing ground.
MongoDB secondaries copy operations from the primary's oplog and apply them locally. Replication lag means a secondary has not applied operations as recently as the primary has. That can affect secondary reads, backups taken from secondaries, analytics jobs, and failover. It can also hide a bigger risk: if the secondary falls behind farther than the oplog window, it may not be able to catch up from the oplog at all.
The fastest troubleshooting path is to answer three questions:
- Is every secondary lagging, or only one?
- Is the lag temporary, steady, or growing?
- Is the secondary still within the oplog window?
Those answers decide what you do next.
Measure Lag Without Guessing
Start in mongosh:
rs.status()
Find the primary and compare its optimeDate with each secondary's optimeDate. Also look for unhealthy members, heartbeat messages, and members stuck in states such as RECOVERING or STARTUP2.
For a friendlier summary, run:
rs.printSecondaryReplicationInfo()
Some older material uses rs.printSlaveReplicationInfo(). If you maintain older systems, you may still see that helper. The modern wording is "secondary".
Then check the oplog window:
rs.printReplicationInfo()
The oplog window is the amount of history currently retained in the oplog. If your secondary is 40 minutes behind and the oplog window is several days, you have room to troubleshoot. If your secondary is 40 minutes behind and the oplog window is 1 hour during peak traffic, you are close to a rebuild situation.
Do not rely only on SecondsBehind-style values from a single tool. Clock skew, delayed members, and brief bursts can make one number misleading. Compare status output with monitoring graphs for write volume, disk latency, CPU, and network throughput.
If All Secondaries Are Lagging
When every secondary falls behind at roughly the same time, the cause is usually upstream of any one secondary. Look at the primary's write workload first.
Common triggers include:
- Bulk imports or backfills.
- Large
updateManyordeleteManyoperations. - TTL cleanup after a period of backlog.
- Application deploys that changed write volume.
- Index builds or schema maintenance.
- A sudden increase in small writes that create many oplog entries.
Ask what changed at the same time the lag started. A spike that begins exactly when a nightly job starts is rarely a MongoDB mystery.
On the primary, inspect active operations:
db.currentOp({ active: true })
If you find a batch job, consider throttling it instead of letting it finish at maximum speed. For example, process documents in _id ranges, sleep between batches, and watch lag. This is especially useful for cleanup jobs where finishing in 30 minutes is less important than keeping the replica set healthy.
If sustained write volume is simply higher than the replica set can handle, you need a capacity or architecture change. Better disks, more CPU, a different instance class, write-path optimization, or sharding may be the right answer. Changing read preference will not fix a primary that is producing more work than the set can apply.
If Only One Secondary Is Lagging
One lagging secondary usually points to a local problem. Log in to that host and check the basics:
iostat -xz 1
vmstat 1
top
Inside MongoDB, use:
mongostat --host secondary.example.com:27017
mongotop --host secondary.example.com:27017
Disk is a common culprit. A secondary using slower storage than the primary may be fine during normal traffic and then fall behind during bursts. Cloud volumes can also hit throughput or IOPS ceilings. Look for high utilization, high await times, and queueing.
CPU can matter when the workload includes many updates, compression, encryption, or heavy query traffic on the same member. Memory pressure matters when the secondary cannot keep hot data and indexes in cache while applying writes.
Also check what else runs on the host. Backups, antivirus scans, filesystem snapshots, log compression, and reporting queries can all compete with replication. If the lagging node is also the "safe place" where everyone runs ad-hoc analytics, you have probably found the problem.
Reads on Secondaries Can Create Lag
Secondary reads are not free. They use the same cache, CPU, and disk that replication needs. A single aggregation that scans a large collection can be enough to make a secondary fall behind during a busy period.
Look for long-running reads:
db.currentOp({ active: true })
If the application sends reads to secondaries, review the read preference. secondary can force reads to lagging members. secondaryPreferred can still return stale data. For user flows that must read their own writes, use the primary. For eventually consistent reads, set maxStalenessSeconds so the driver avoids secondaries that are too far behind.
For reporting workloads, consider a hidden secondary or a separate analytics pipeline. Hidden members can still replicate, but drivers will not choose them for normal reads. That makes them a better place for backups or controlled reporting jobs, as long as you size them properly.
Oplog Size Is a Recovery Margin, Not a Speed Fix
A too-small oplog does not usually cause lag by itself. It makes lag dangerous. If a secondary falls behind and the needed oplog entries are overwritten, it cannot catch up normally.
Your oplog window should be longer than your realistic outage and maintenance scenarios. If a secondary may be offline for 6 hours during patching, a 4-hour oplog window is not enough. If a quarterly import burns through the oplog in a few hours, size for that workload or change how the import runs.
On supported versions, resize with replSetResizeOplog on each member that needs a larger oplog:
use admin
db.adminCommand({ replSetResizeOplog: 1, size: 20480 })
That example requests about 20 GB. In managed platforms, use the managed configuration method. Avoid old advice that drops and recreates the oplog unless you are following a carefully tested recovery procedure.
After increasing the oplog, keep troubleshooting the underlying lag. A larger oplog gives you more time; it does not remove disk saturation, network limits, or excessive write bursts.
Network Checks That Actually Help
Network issues are more likely when lag affects a remote secondary, one availability zone, or one data center path. Start simple:
ping primary.example.com
traceroute primary.example.com
Then look beyond latency. Replication needs reliable throughput. Packet loss, firewall inspection, VPN limits, cross-region bandwidth caps, or overloaded network interfaces can create lag even when ping looks acceptable.
If only the cross-region member lags, compare it with a local secondary under the same write load. You may need a different topology, a bigger link, or a clearer expectation that remote members are for disaster recovery rather than fresh reads.
Data and Index Drift
Replica set members should have the same indexes. If they do not, oplog application can slow down or fail. This usually comes from manual changes, failed maintenance, or a member restored from an inconsistent source.
Compare indexes on hot collections:
db.orders.getIndexes()
Run it on the primary and on the lagging secondary. If definitions differ, fix the drift deliberately. Rebuilding a large index can add more load, so schedule it carefully or rebuild the member from a clean source if the differences are widespread.
Data divergence is more serious. If replication errors show missing records or duplicate keys, lag is no longer the only problem. You need to inspect the error, compare data, and decide whether a table-level repair, resync, or full rebuild is the safest path.
Be Conservative with Restarts and Initial Sync
Restarting a lagging secondary sometimes helps if the process is stuck behind a transient issue. It is not a universal fix. If the member is close to the edge of the oplog window, a restart can cost enough time to push it into an unrecoverable state.
Before restarting, check:
- Current lag.
- Current oplog window.
- Whether the member is syncing.
- Whether other healthy secondaries exist.
- Whether the replica set can tolerate the member being down.
Initial sync is the clean answer when a secondary cannot catch up or its data is not trustworthy. It is also heavy. It copies data, builds indexes, and consumes resources from another member. Rebuild one member at a time, and make sure your voting configuration still supports safe elections while the node is rebuilding.
When You Should Not Rush to Fix It
Some lag is expected during controlled work. If you are running a planned backfill, restoring a secondary, or importing historical data, the useful question is whether the secondary is catching up at an acceptable rate. A lag graph that rises for 20 minutes and then steadily falls may not need intervention. A lag graph that rises every day and never returns to baseline does.
This distinction matters because some fixes are disruptive. Killing a batch job may leave application data half-updated. Restarting a secondary may cost cache warmth and make catch-up slower. Rebuilding a member may consume more network and disk than simply letting it apply the backlog.
For planned jobs, set a lag budget before the job starts. For example, you might decide that a maintenance backfill can create up to 10 minutes of lag on a reporting secondary, but not on a failover candidate. Watch the lag, the oplog window, and the write rate while the job runs. If the job approaches the budget, pause it or reduce batch size.
It also helps to separate user-facing replicas from maintenance replicas. A secondary used for application reads should have a tighter lag tolerance than a hidden member used for backups. If every secondary has a different job, alert thresholds should reflect those jobs instead of using one number for the whole set.
What to Record During an Incident
Replication incidents are much easier to understand after the fact if you save the right evidence. Before changing configuration, capture:
rs.status()
rs.conf()
rs.printReplicationInfo()
rs.printSecondaryReplicationInfo()
Also save host-level metrics from the primary and the lagging secondary: disk latency, CPU, memory, and network throughput. If a batch job or deploy was running, record its start time and command or release version.
This is not paperwork for its own sake. Without a timeline, the next incident starts from zero. With a timeline, you may notice that lag always follows a specific export, backup, or cleanup task. That turns a vague database problem into a schedulable capacity problem.
A Practical Fix Map
Use the symptom to choose the next move:
| Symptom | Likely area | Next action |
|---|---|---|
| All secondaries lag during batch job | Write burst | Throttle or split the job |
| One secondary always lags | Local resource issue | Check disk, CPU, memory, and local reads |
| Lag grows on remote member only | Network/topology | Check throughput, packet loss, and cross-region design |
| Lag is near oplog window | Recovery risk | Increase oplog and reduce lag source |
| Secondary serves stale reads | Read preference | Use primary for fresh reads or set maxStalenessSeconds |
| Member cannot catch up after downtime | Missing oplog history | Rebuild from backup or initial sync |
Good MongoDB replication troubleshooting is mostly disciplined observation. Find whether the primary is producing too much work, the secondary is applying too slowly, or the link between them is constrained. Then change the thing that is actually limiting replication instead of applying a generic restart, resync, or configuration tweak.