Troubleshooting Common MySQL Replication Failures Quickly
Quickly resolve common MySQL replication failures with this practical guide. Learn to interpret error codes from `SHOW REPLICA STATUS`, inspect MySQL error logs, and understand the purpose of binary logs. This article provides actionable steps and best practices for diagnosing issues like duplicate entries, missing binlog files, and data divergence, helping you maintain a healthy replication setup.
Troubleshooting Common MySQL Replication Failures Quickly
MySQL replication failures are easier to fix when you separate two questions: can the replica fetch events from the source, and can it apply the events it already fetched? Those are different failures. A network issue, missing binary log, bad password, or wrong host grant usually stops the I/O thread. A duplicate key, missing row, DDL mismatch, or data drift usually stops the SQL thread.
Start with the status output. On modern MySQL:
SHOW REPLICA STATUS\G
On older systems:
SHOW SLAVE STATUS\G
Use whichever command your server supports. Newer output uses names such as Replica_IO_Running, Replica_SQL_Running, and Seconds_Behind_Source. Older output uses Slave_IO_Running, Slave_SQL_Running, and Seconds_Behind_Master.
The first useful read is:
Replica_IO_Running: whether the replica is connected and reading source binary logs.Replica_SQL_Running: whether the replica is applying relay log events.Last_IO_ErrnoandLast_IO_Error: why fetching failed.Last_SQL_ErrnoandLast_SQL_Error: why applying failed.Relay_Master_Log_File,Exec_Master_Log_Pos, or newer source-position fields: where the replica is in the stream.
Do not skip straight to a fix. Copy the full status output into your incident notes first. Once you run RESET REPLICA, skip a transaction, or re-point the replica, some of the best evidence disappears.
If the I/O Thread Is Stopped
When Replica_IO_Running is No, the replica is not successfully reading from the source. The SQL thread may still be applying older relay log events for a while, but eventually it will run out.
Common causes are:
- The source host or port is wrong.
- A firewall, security group, or routing rule blocks the connection.
- The replication user password is wrong.
- The replication user is allowed from a different host than the replica actually uses.
- Binary logging is disabled on the source.
- The source has purged the binary log file the replica requested.
- TLS settings changed and the replica can no longer authenticate.
Test from the replica host:
mysql -h source-db.example.com -u repl_user -p
If a direct login fails, replication will fail too. Check the account on the source:
SHOW GRANTS FOR 'repl_user'@'replica_host_or_ip';
The account needs REPLICATION SLAVE privilege. The privilege name still uses "SLAVE" in MySQL grants.
Also check whether binary logging is enabled:
SHOW VARIABLES LIKE 'log_bin';
SHOW MASTER STATUS;
On newer versions, SHOW BINARY LOG STATUS may be available. The point is the same: the source must have binary logs, and the requested file must still exist.
Error 1236: Missing or Unreadable Binary Log
Last_IO_Errno: 1236 is one of the errors that usually means the replica is asking for a binary log file or position the source cannot provide. The exact message varies. It may say the first log file cannot be found, a log event could not be read, or the source closed the connection while reading.
The most common operational case is simple: the replica was down too long, and the source purged the binary logs it needed.
Check what logs remain on the source:
SHOW BINARY LOGS;
Then compare that list with the file named in the replica status. If the replica needs mysql-bin.000120 and the source now starts at mysql-bin.000140, the replica cannot catch up from binary logs.
You have three realistic choices:
- Restore or rebuild the replica from a fresh backup taken from the source.
- Use another replica that still has the needed data as a clone source, if your process supports that.
- If using GTID and the missing transactions exist elsewhere, reconfigure from a valid source that can provide them.
Do not guess a newer log position just to make replication start. That creates a replica with missing transactions. It may look healthy while quietly returning wrong data.
After recovery, increase binary log retention if disk capacity allows it:
[mysqld]
binlog_expire_logs_seconds=604800
That example is about 7 days. Choose a value based on how long replicas may be offline during maintenance or incidents.
If the SQL Thread Is Stopped
When Replica_SQL_Running is No, the replica fetched events but could not apply one. This is often a data consistency problem, not a connection problem.
Read the full Last_SQL_Error. It usually tells you the table, key, failing operation, and sometimes the source log position. Then inspect the relevant row on both source and replica before changing anything.
For a failed event around a known binary log position, mysqlbinlog can show the event:
mysqlbinlog --start-position=123456 --stop-position=124500 /var/lib/mysql/mysql-bin.000321
If the source binary logs are not on the local host, use remote options or inspect a copied log file. Be careful with row-based events: they may need decoding options and table metadata to be readable.
Error 1062: Duplicate Entry
Last_SQL_Errno: 1062 means the replica tried to insert or update a row and hit a unique key that already exists.
Typical causes include:
- Someone wrote directly to the replica.
- The replica was initialized from the wrong snapshot.
- A previous replication error was skipped.
- Auto-increment settings are wrong in a multi-source or active-active design.
- Application writes went to two writable servers by mistake.
The tempting fix is:
STOP REPLICA;
SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1;
START REPLICA;
Older syntax uses STOP SLAVE and START SLAVE. This may be acceptable for a disposable reporting replica after you confirm the row does not matter. It is dangerous for a replica that may be promoted later. Skipping means the replica no longer has the same transaction history as the source.
A safer process is:
- Identify the conflicting table and key.
- Compare the row on source and replica.
- Decide whether the replica row should be deleted, updated, or the replica should be rebuilt.
- Record the decision, because this is now a data consistency event.
If the replica is meant for failover, rebuilding is often cleaner than patching several unknown differences by hand.
Error 1032: Cannot Find Record
Last_SQL_Errno: 1032 usually means the replica tried to update or delete a row that does not exist locally. This is the mirror image of many duplicate-key problems. The source had a row; the replica did not.
Common causes are:
- A row was deleted manually on the replica.
- A previous transaction was skipped.
- The initial dump missed data.
- Replication filters excluded earlier writes.
Do not assume the missing row is harmless. If an UPDATE cannot find a row, the replica is already different from the source. Compare counts and sample data around the affected key. If the table is small, a table reload may be reasonable. If it is large or critical, use a consistency tool or rebuild the replica.
Authentication and Host Grant Problems
A very common failure after password rotation or network changes is an I/O error that looks like access denied:
Access denied for user 'repl_user'@'10.0.2.15'
The host in the error is the one MySQL sees. It may not match the hostname you expected, especially with NAT, proxies, or container networking.
On the source, inspect users:
SELECT user, host, plugin FROM mysql.user WHERE user = 'repl_user';
If the replica connects from 10.0.2.15, a grant for 'repl_user'@'replica.internal' may not match unless name resolution and grants are aligned. Prefer explicit host patterns that match your network design.
If the plugin differs, older clients may fail against accounts using newer authentication plugins. Updating the client is usually better than weakening authentication, but in mixed-version environments you may need a planned compatibility change.
Relay Log Problems
Sometimes the source connection is fine, but the replica has relay log corruption or local disk trouble. The error may mention a relay log read failure, truncated event, or relay log position.
First check disk health and free space. A full disk can create several strange replication symptoms:
df -h
iostat -xz 1
If the relay log is corrupt but the source still has the needed binary logs, you can often reset relay logs and let the replica fetch again. The exact command depends on version and topology. Do not run reset commands casually; confirm you know the source log file and position that has already been executed.
In many cases, this kind of issue is a sign that the replica host had an underlying storage problem. Fix that before trusting the replica again.
Replication Lag Is Not Always Failure
Seconds_Behind_Source can be high while both threads are running. That means replication is alive but behind. Treat lag differently from a stopped thread.
Check:
- Is the replica disk saturated?
- Is the source generating a write burst?
- Are long reads on the replica competing with the SQL thread?
- Is the replica smaller or slower than the source?
- Did a backup job or snapshot start at the same time?
If lag is falling, the replica is catching up. If lag is growing, remove load or add capacity. Restarting a lagging replica rarely fixes a sustained resource bottleneck.
Filters and Multi-Source Replication
Replication filters can make failures harder to read. A replica may intentionally ignore some databases or tables, but the application may still expect related data to exist. If you use filters, inspect them before assuming the replica is corrupt:
SHOW REPLICA STATUS\G
Look for fields that mention Replicate_Do_DB, Replicate_Ignore_DB, Replicate_Do_Table, or rewrite rules. Older output uses the same general names under SHOW SLAVE STATUS.
Filtering is especially risky with cross-database writes. If a transaction updates app.orders and audit.order_events, but the replica filters out audit, the resulting copy may be technically consistent with the filter and still useless for a workflow that expects audit rows. Statement-based logging can make database filters even more surprising because the selected default database can influence whether an event is replicated.
Multi-source replication adds another layer. One channel can be healthy while another is stopped. In that case, check status for all channels instead of reading only the first block of output:
SHOW REPLICA STATUS\G
On channel-based setups, the status output includes a channel name. Fix the failed channel without resetting healthy channels. If two sources can write overlapping keys into the same table, duplicate key errors are often a design issue rather than a one-off replication failure.
Avoid Hidden Data Drift
The worst replication failure is the one that says Yes and still contains wrong data. Drift can happen after skipped transactions, direct writes to replicas, failed imports, bad filters, or manual repairs.
For important replicas, schedule consistency checks. Percona Toolkit's pt-table-checksum is commonly used for this, and pt-table-sync can help repair differences in controlled situations. These tools can create load, so test them first and run them with limits that match your production environment.
Also protect replicas from accidental writes:
[mysqld]
read_only=ON
super_read_only=ON
Use separate credentials for application reads. Do not let application users have broad write privileges on replicas "just in case".
A Fast Incident Checklist
Use this order when replication breaks:
- Save
SHOW REPLICA STATUS\Goutput. - Check whether the I/O thread or SQL thread stopped.
- Read
Last_IO_ErrororLast_SQL_Error; do not rely only on the error number. - Check the MySQL error log for matching timestamps.
- For I/O failures, test network, credentials, grants, TLS, and binary log availability.
- For SQL failures, inspect the affected row or table on both source and replica.
- Decide whether to repair, skip with documented risk, reload a table, or rebuild the replica.
- After recovery, run a real write test and monitor lag.
Most MySQL replication failures are not solved by one magic command. They are solved by preserving evidence, identifying which thread failed, and choosing a fix that does not leave you with a replica that is running but untrustworthy.