Troubleshooting High WAL Activity and Managing Archive Log Disk Space
Learn to troubleshoot and manage excessive Write-Ahead Log (WAL) generation in PostgreSQL. This guide covers common causes of high WAL activity, such as bulk operations and replication issues, and provides practical solutions for configuring WAL archiving, managing replication slots, and preventing disk space exhaustion. Essential reading for PostgreSQL administrators focused on stability and efficient disk space utilization.
Troubleshooting High WAL Activity and Managing Archive Log Disk Space
High Write-Ahead Log (WAL) activity in PostgreSQL is not automatically a problem. A busy database should generate WAL. The problem is when the WAL rate surprises you, when archived WAL never gets cleaned up, or when pg_wal grows because something is preventing PostgreSQL from recycling old segments.
The fastest way to make a WAL incident worse is to delete files from pg_wal by hand. Do not do that. Treat a full WAL disk as a recovery situation: identify what is retaining WAL, create breathing room if you safely can, then fix the failed archive, lagging standby, or abandoned slot that caused the growth.
Understanding Write-Ahead Logging (WAL)
PostgreSQL writes change records to WAL before the related data pages are safely written. After a crash, PostgreSQL replays WAL so committed changes are not lost. The same stream is also used for streaming replication and point-in-time recovery.
WAL files are stored in fixed-size segment files. Many installations use 16 MB segments because that is the common default, but the size is chosen when the cluster is initialized. A write-heavy workload can create a large number of segments quickly. Old segments are recycled or removed only after PostgreSQL no longer needs them for crash recovery, checkpoints, archiving, replication, or slots.
Key WAL Concepts:
- Durability: committed transactions can be recovered after a crash.
- Replication: standbys receive WAL records from the primary.
- Point-in-Time Recovery (PITR): base backups plus archived WAL let you recover to a chosen point within your retention window.
- WAL segments: WAL is stored in segment files under
pg_wal.
Common Causes of High WAL Activity
Several factors can contribute to an unusually high volume of WAL generation. Identifying the root cause is the first step in effective troubleshooting.
1. Bulk Data Loading and Modifications
Operations like INSERT, UPDATE, DELETE, TRUNCATE, and COPY can generate significant amounts of WAL. Bulk operations, especially on large tables, will naturally produce more WAL records than small, individual transactions.
- Example: A single
COPY FROMcommand to insert millions of rows can generate gigabytes of WAL data. - Example: Running a large-scale data migration or a batch update script.
2. Replication Lag and Standby Issues
If your standby servers are not keeping up with the primary (replication lag), WAL files will accumulate on the primary. The primary server cannot remove completed WAL segments until they have been confirmed to be sent to and processed by all connected standbys (if wal_keep_size or max_slot_wal_keep_size is not configured, or if slots are not used correctly).
- Scenario: A standby server is down, disconnected, or experiencing performance issues, preventing it from consuming WAL records from the primary.
3. Full-Page Writes After Checkpoints
After a checkpoint, the first change to a data page may log a full-page image when full_page_writes is enabled. That setting protects recovery from torn pages and is normally left on. If checkpoints happen too often, full-page images can noticeably increase WAL volume. The fix is usually to tune checkpoint behavior, not to disable durability protections.
4. Unmanaged pg_wal Directory Growth
If WAL archiving is enabled and failing, PostgreSQL keeps WAL segments that still need to be archived. If archiving is not enabled, pg_wal should still recycle old segments when they are no longer needed, unless replication, slots, or checkpoint pressure are retaining them.
5. Replication Slots Not Being Reclaimed
Replication slots guarantee that WAL segments are not removed before they are consumed by a specific standby or logical decoding client. If a slot is created but the consumer stops or disconnects without the slot being dropped, the WAL segments required by that slot will be retained, even if the standby is no longer active.
Managing WAL Disk Space: Configuration and Solutions
Addressing high WAL activity requires a multi-pronged approach involving monitoring, configuration tuning, and proper maintenance procedures.
1. Enable and Monitor WAL Archiving
WAL archiving is the most critical mechanism for managing disk space and enabling PITR. When archiving is enabled, completed WAL files are copied to a separate location (e.g., a network file share, S3 bucket, or a different disk).
Configuration:
Modify your postgresql.conf file:
wal_level = replica # Or logical for logical replication
archive_mode = on # Enable archiving
archive_command = 'cp %p /path/to/archive/%f'
# Example for S3 using wal-g or similar tool:
# archive_command = 'wal-g wal-push %p'
%p: Placeholder for the full path to the WAL file to be archived.%f: Placeholder for the filename of the WAL file.
Important: The archive_command must be able to execute successfully. If it returns a non-zero exit code, PostgreSQL will consider archiving to have failed, which can lead to WAL files not being removed. Ensure the destination directory has sufficient space and the user running PostgreSQL has write permissions.
Monitoring Archiving
Use SQL queries to check the status of archiving:
SELECT archived_count,
failed_count,
last_archived_wal,
last_archived_time,
last_failed_wal,
last_failed_time
FROM pg_stat_archiver;
If failed_count keeps increasing or last_archived_time is old while the database is still writing, fix the archive destination before tuning WAL size parameters.
2. Managing pg_wal Directory Size
Even with archiving enabled, the pg_wal directory on the primary can grow if WAL segments are not removed after archiving. This happens if:
- Standbys are not keeping up and the primary is retaining extra WAL for replication.
- Replication slots are holding onto WAL files.
wal_keep_size
This parameter keeps extra WAL on the primary for streaming replication. It replaced the older wal_keep_segments setting in PostgreSQL 13. It is useful for standbys that do not use slots, but it is not a guarantee that a badly lagging standby can always catch up.
# postgresql.conf on primary
wal_keep_size = 1024 # Keep 1GB of WAL on disk
Replication slots are often preferred when you need the primary to retain WAL for a specific consumer, but slots must be monitored because they can retain WAL indefinitely.
max_slot_wal_keep_size (PostgreSQL 13+)
This setting limits how much WAL a replication slot can retain. It is a guardrail against unlimited growth from a broken slot, but it can also cause a lagging consumer to lose the WAL it needs and require reinitialization.
# postgresql.conf on primary
max_slot_wal_keep_size = 2048 # Limit slots to retain 2GB of WAL
# Also consider: wal_keep_size -- still relevant for non-slot based streaming
# wal_keep_size = 1024 # Keep 1GB for non-slot streaming
If a slot falls too far behind and exceeds this limit, PostgreSQL may remove needed WAL at checkpoint time. That protects disk space, but the affected standby or logical replication client may no longer be able to continue from its old position.
Replication Slots
Replication slots are crucial for preventing WAL loss and ensuring reliable replication. However, they can cause WAL files to accumulate if not managed correctly.
- Problem: A replication slot is created, but the consumer (standby or logical client) disconnects or fails, and the slot is never dropped. The primary server will keep all WAL files that the slot is waiting for.
- Solution: Regularly monitor replication slots and drop any that are no longer in use.
-- List replication slots
SELECT slot_name,
plugin,
slot_type,
active,
restart_lsn,
wal_status,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC NULLS LAST;
-- Drop an unused slot
SELECT pg_drop_replication_slot('slot_name_to_drop');
Warning: Dropping a replication slot will cause any connected consumer to lose its position. Ensure the consumer is no longer needed or has been properly re-initialized before dropping.
3. Tuning min_wal_size and max_wal_size
These parameters influence checkpoint frequency and how much WAL PostgreSQL tries to keep around for reuse. They do not cap WAL retained for archiving or replication.
min_wal_size: encourages PostgreSQL to keep at least this much WAL for reuse instead of removing it immediately.max_wal_size: the WAL volume that tends to trigger a checkpoint. It is not a hard maximum when WAL is retained for other reasons.
# postgresql.conf
min_wal_size = 1GB
max_wal_size = 4GB
Increasing max_wal_size can give the system more headroom during peak write loads, but it also means more disk space will be occupied by pre-allocated WAL files.
4. Regular Cleanup of Archived WAL Files
WAL archiving, while essential for recovery, can also lead to disk space issues if the archived files are never cleaned up. You must have a strategy for managing the retention of your archived WAL files.
Strategy: Implement a script or use a dedicated tool (like
pg_archivecleanup,pgBackRest,wal-g,barman) to remove old WAL files from the archive location once they are no longer needed for PITR or replication.Using
pg_archivecleanup: This utility can be run on the primary server to remove old WAL files from the archive directory.pg_archivecleanup /path/to/archive/location 0000000100000037000000AFThe second argument is the oldest WAL file name that still needs to be kept, not an arbitrary age. In practice, backup tools such as pgBackRest, Barman, and WAL-G are safer because they understand backup retention and the WAL required for recovery.
Important: Always ensure your cleanup strategy aligns with your backup and recovery Point-In-Time Recovery (PITR) requirements. You need to retain WAL files long enough to cover your desired recovery window.
5. Monitoring Disk Space and WAL Generation Rate
Proactive monitoring is key to preventing disk space exhaustion.
Monitor Disk Space: Track free space in the data directory,
pg_wal, temporary file locations, and archive destinations.Monitor WAL Generation: Use LSN differences over time to estimate generation rate.
SELECT now() AS sample_time, pg_current_wal_lsn() AS current_lsn;Store that value periodically and compare samples with
pg_wal_lsn_diff(new_lsn, old_lsn). For a quick view of the currentpg_waldirectory size:SELECT pg_size_pretty(sum(size)) AS pg_wal_size FROM pg_ls_waldir();
Troubleshooting Steps for Full Disks
If your disk is already full due to WAL activity, immediate action is required:
- Identify the Cause: Check
pg_stat_archiverfor archiving failures. Examinepg_replication_slotsfor unused or problematic slots. Check replication lag on standbys. - Free Up Space Without Damaging Recovery:
- Do not delete files from
pg_walmanually. - If the archive destination is full, remove old archived WAL only when it is outside your backup retention window.
- If possible, add storage or move the archive destination, then let PostgreSQL archive and recycle normally.
- Do not delete files from
- Address the Root Cause:
- Fix Archiving: Ensure
archive_commandis correct and the destination has space. - Manage Slots: Drop any unused replication slots.
- Fix Replication: Address issues causing standby lag.
- Increase Disk Space: Temporarily or permanently add more storage.
- Fix Archiving: Ensure
- Nudge the Archiver: After fixing the archive command or destination, PostgreSQL should retry. A reload may be enough for configuration changes; a full restart should be a last resort during a disk incident.
A Safer Mental Model
When pg_wal is growing, ask three questions in order:
- Is PostgreSQL generating more WAL than usual because the workload changed?
- Is PostgreSQL unable to archive WAL?
- Is PostgreSQL being told to retain WAL for replication or a slot?
Those answers point to different fixes. Bulk writes may need scheduling, batching, or checkpoint tuning. Archive failures need storage and command fixes. Slot retention needs consumer recovery, slot cleanup, or a retention limit. Guessing at max_wal_size rarely solves the real problem by itself.