5 Common MongoDB Troubleshooting Scenarios and Quick Fixes

MongoDB troubleshooting usually starts when your app gets slow, writes fail, or a replica set falls behind. This guide walks through five common scenarios you are likely to see in production and shows where to look first.

Use these checks as a first pass before you make bigger changes. They help you separate query problems from infrastructure, replication, or sharding issues.

1. Slow Query Performance

Slow queries are perhaps the most common performance issue reported in production environments. A query that takes seconds instead of milliseconds can severely degrade application responsiveness.

Diagnosis: Using `explain()`

The first step in diagnosing a slow query is understanding why it is slow. MongoDB's explain() method is the essential tool for this analysis. It shows the execution plan, detailing which indexes were used (or not used).

Command example:

db.collection.find({ field: 'value' }).explain('executionStats')

Analyze the output, specifically looking for:

winningPlan.stage: If the stage is COLLSCAN, MongoDB is reading every document. That often points to a missing or unusable index.
executionStats.nReturned compared with executionStats.totalKeysExamined and executionStats.totalDocsExamined.

Quick Fixes

Create the right index: If the query plan shows a collection scan, add an index that matches the filter and sort pattern. For example, if your app frequently searches orders by user_id and newest timestamp, create a compound index:

db.orders.createIndex({ user_id: 1, timestamp: -1 }) ``` 2. Refine the query: Check whether you are fetching too much data. Use projection to return only the fields the page or job actually needs. 3. Review slow query logs: Use the profiler or slow query log with a threshold that fits your workload. Treat any exact threshold as an operational choice, not a universal rule.

Tip: Indexes improve read speed but slightly slow down writes. Only index fields that are frequently used in query predicates (find()), sort operations (sort()), or range queries.

2. Replication Lag in Replica Sets

Replication lag occurs when secondary members of a replica set fall significantly behind the primary member in applying operations from the oplog (operation log).

Diagnosis: Checking `replSetGetStatus`

Use the replSetGetStatus command on any member of the replica set to examine the health and synchronization status of all members.

Command example:

rs.printReplicationInfo()
// Or directly querying the status:
rs.status()

Look for the optimeDate for the primary and the secondaries. The difference between the primary's optime and a secondary's optime indicates the lag, usually shown in the secsBehind field for each member.

Quick Fixes

Check network latency: High latency between members can slow oplog transfer.
Check the lagging secondary: High CPU, slow disk I/O, or noisy neighbor workloads can stop a secondary from applying writes fast enough.
Review oplog coverage: If the lag is severe, the secondary may no longer have the oplog entries it needs. In that case, you may need to resync or rebuild that member.

3. Connection Errors and Authentication Failures

Application services frequently fail to connect to MongoDB due to configuration errors, firewall issues, or incorrect credentials.

Diagnosis: Checking Logs and Network

First, verify if the MongoDB server is listening on the expected IP address and port. Check the MongoDB server logs for specific errors.

Common Log Errors:

Address already in use: Another process is using the port.
Connection refused: The server process is down, blocked, or listening somewhere else.
Authentication failed: The username, password, authentication database, or role assignment is wrong.

Quick Fixes

Check firewall rules: Make sure the MongoDB port, often 27017, is reachable from the application hosts.
Verify bindIp: If mongod.conf binds only to 127.0.0.1, remote clients cannot connect. Bind to a specific private interface when possible. Avoid 0.0.0.0 unless network controls and authentication are already in place.
Check authSource: If the user was created in admin, the connection string may need ?authSource=admin.

4. Running Out of Disk Space

As a document database, MongoDB stores data directly on disk. Unexpected data growth or improperly handled database cleanups can quickly lead to disk space exhaustion, halting all write operations.

Diagnosis: Monitoring and `db.stats()`

Use OS monitoring tools (df -h on Linux) to check overall disk usage. Within MongoDB, use the db.stats() command to see how much space individual databases are consuming.

Command example:

db.stats()

Look specifically at the storageSize and dataSize fields.

Quick Fixes

Buy time if writes are failing: Stop non-essential jobs, remove unrelated temporary files, or expand the volume if your platform supports it.
Remove unused data: Drop old collections or databases only after you confirm they are no longer needed and backups exist.
Compact carefully: For collections with many deletes or updates, compact may free reserved space, but it can be disruptive. Test the impact for your MongoDB version and storage engine:

db.myCollection.runCommand({ compact: 'myCollection' }) ``` 4. Increase storage capacity: The long-term fix is usually larger disks, better retention rules, or separate storage for logs and backups.

Warning: If the disk fills up entirely, MongoDB will stop writing to prevent data corruption. You must resolve space issues before attempting to resume normal operations.

5. Sharding Cluster Errors (Stale Routers/Config Servers)

In sharded environments, connectivity or state issues within the configuration servers (config servers) or query routers (mongos instances) can halt the entire system.

Diagnosis: Checking Cluster Health

The sh.status() command run against a mongos instance is the primary diagnostic tool for sharding health.

Actionable Command Example:

sh.status()

Key areas to check in the output include:

Config servers: Confirm the config server replica set has a healthy majority.
Shards: Verify that every listed shard is connected and reporting correctly.
Stale status: Look for warnings that a router or shard has stale metadata.

Quick Fixes

Restart mongos when appropriate: If one router is stale or unresponsive, restarting it can force a fresh connection to the config servers.
Fix config server health first: If the config server replica set lacks a healthy majority, shard metadata operations can fail.
Resolve shard-level problems: If a shard is down because of disk pressure or replication lag, fix that root cause before chasing router symptoms.

When to See a Professional

Bring in a MongoDB administrator or platform engineer when data loss is possible, a replica set needs a resync, config servers are unhealthy, or disk space is already affecting writes. Get help before running disruptive commands such as compaction or member rebuilds in production.

Takeaway

Start MongoDB troubleshooting with the symptom closest to the user impact: slow page, failed connection, stalled write, lagging secondary, or sharded cluster error. Then use explain(), rs.status(), db.stats(), and sh.status() to confirm the cause before changing indexes, restarting routers, or rebuilding members.

5 Common MongoDB Troubleshooting Scenarios and Quick Fixes

1. Slow Query Performance

Diagnosis: Using explain()

Quick Fixes

2. Replication Lag in Replica Sets

Diagnosis: Checking replSetGetStatus

Quick Fixes

3. Connection Errors and Authentication Failures

Diagnosis: Checking Logs and Network

Quick Fixes

4. Running Out of Disk Space

Diagnosis: Monitoring and db.stats()

Quick Fixes

5. Sharding Cluster Errors (Stale Routers/Config Servers)

Diagnosis: Checking Cluster Health

Quick Fixes

When to See a Professional

Takeaway

Diagnosis: Using `explain()`

Diagnosis: Checking `replSetGetStatus`

Diagnosis: Monitoring and `db.stats()`