Setting Up Synchronous Replication for High Availability in PostgreSQL

Configuring PostgreSQL for high availability (HA) usually starts with a hard question: how much data can you afford to lose if the primary server disappears right after a commit? With normal asynchronous streaming replication, the answer is "maybe some." The primary can tell the application that a transaction committed before the standby has received or replayed the WAL record. If the primary fails during that small window, the promoted standby may not contain the last few committed transactions.

Synchronous streaming replication changes that tradeoff. PostgreSQL waits for one or more named standbys before reporting commit success. Depending on the synchronous_commit level, the standby may only need to write the WAL to the operating system, flush it to durable storage, or replay it so queries on the standby can see it. That can give you an RPO of zero for committed transactions, but it also means the write path now depends on the network and the standby's health.

That tradeoff matters. Synchronous replication is a good fit for the small set of data where losing even one acknowledged transaction is unacceptable: payments, account balances, inventory reservations, order state, audit trails. It is often a poor fit for high-volume event logs, clickstream data, metrics, or workloads where availability and latency matter more than perfect durability across nodes. Before you enable it globally, decide which part of your workload actually needs it.

Prerequisites

Before starting, ensure you have two PostgreSQL servers set up (Primary and Standby) running identical major versions of PostgreSQL. Both servers must have network connectivity. For this guide, we assume:

Primary Hostname/IP: pg_primary
Standby Hostname/IP: pg_standby
Replication User: repl_user
Database Name: mydb

You also need a working backup and a maintenance window for the initial base backup. The examples assume PostgreSQL 12 or newer, where standby mode is controlled with standby.signal and connection settings are usually written by pg_basebackup -R.

Step 1: Configuring the Primary Server

The primary server requires specific settings to enable streaming replication and manage the Write-Ahead Log (WAL) required by synchronous commits.

A. Adjusting `postgresql.conf` on the Primary

Edit the primary server's postgresql.conf file. The following parameters are mandatory for streaming replication:

# --- Required for Replication ---
listen_addresses = '*'         # Allows connections from standby
wal_level = replica            # Must be 'replica' or higher (e.g., 'logical')
max_wal_senders = 10           # Max concurrent connections from standbys
max_replication_slots = 10     # Slots needed for persistent replication streams

# --- Essential for Synchronous Commit ---
synchronous_standby_names = 'FIRST 1 (standby1)' # Specifies required standbys by application_name

# --- Optional but Recommended ---
wal_log_hints = on             # Recommended for safer replication, though it increases WAL volume
shared_preload_libraries = 'pg_stat_statements' # If using monitoring

Explanation of Key Parameters:

wal_level = replica: This ensures that sufficient information is written to the WAL to allow a standby server to reconstruct the database state. For synchronous commits, this level is the minimum requirement.
synchronous_standby_names: This is the core setting for defining which standbys must acknowledge writes. The names here are replication connection application_name values, not replication slot names. FIRST 1 (standby1) means PostgreSQL waits for the first available synchronous standby from that list. ANY 1 (standby1, standby2) means any one of the listed standbys can satisfy the commit.

B. Configuring Host-Based Authentication (`pg_hba.conf`)

The primary server must allow the replication user from the standby server(s) to connect for replication purposes.

Add an entry to pg_hba.conf on the primary:

# TYPE  DATABASE        USER            ADDRESS                 METHOD
host    replication     repl_user       pg_standby/32           scram-sha-256

Replace pg_standby/32 with the actual IP address or subnet of your standby server.

C. Creating the Replication Slot and User

Connect to PostgreSQL on the primary server to create the necessary user and the replication slot.

1. Create Replication User:

CREATE ROLE repl_user WITH REPLICATION LOGIN PASSWORD 'a_strong_password';

2. Create Replication Slot:

This slot ensures WAL segments are retained until the standby confirms receipt, preventing the standby from falling behind so far that it needs a new base backup. Slots are useful, but they can also fill disks if a standby is down for a long time, so monitor retained WAL.

SELECT pg_create_physical_replication_slot('standby1_slot');

The slot name does not have to match synchronous_standby_names. In this example, standby1 is the application_name used for synchronous standby selection, while standby1_slot is the physical replication slot used for WAL retention.

D. Restarting the Primary

Apply all configuration changes by restarting the PostgreSQL service on the primary server.

sudo systemctl restart postgresql

Step 2: Configuring the Standby Server

The standby server is configured to stream WAL records from the primary using a recovery configuration.

A. Base Backup

Before starting streaming, the standby needs a full copy of the primary's data directory. Stop PostgreSQL on the standby first.

sudo systemctl stop postgresql

Take the base backup using pg_basebackup. Replace paths and connection details as necessary:

# Example using the pg_basebackup utility
pg_basebackup -h pg_primary -D /var/lib/postgresql/15/main/ -U repl_user -P -Xs -R -W

-D: The target data directory on the standby.
-U: The replication user.
-P: Show progress.
-Xs: Include necessary WAL files during the base backup.
-R: Automatically create the standby.signal file and generate the necessary connection settings in postgresql.auto.conf (or recovery configuration).

B. Configuring `postgresql.conf` on the Standby

On the standby, ensure PostgreSQL knows how to connect back to the primary. The key detail for synchronous replication is application_name; it must match the name listed in synchronous_standby_names.

# --- Required on Standby ---
primary_conninfo = 'host=pg_primary port=5432 user=repl_user password=a_strong_password application_name=standby1'
primary_slot_name = 'standby1_slot'
hot_standby = on          # Allows read queries during recovery/standby mode

C. Starting the Standby

Start the PostgreSQL service on the standby server.

sudo systemctl start postgresql

Step 3: Verification and Testing Synchronous Commit

Once both servers are running, verify the connection and then test the synchronous behavior.

A. Verifying Replication Status

Connect to the primary database and check the pg_stat_replication view:

SELECT client_addr, application_name, state, sync_state FROM pg_stat_replication;

You should see an entry for standby1 with sync_state as sync. If it shows potential, the standby is connected but is not currently the one satisfying synchronous commits. If it shows async, PostgreSQL is not treating it as a synchronous standby; check the spelling of application_name and synchronous_standby_names.

B. Testing Synchronous Commit

The global parameter that dictates how hard PostgreSQL waits is synchronous_commit. For RPO=0, you must use a value that forces synchronization.

1. Setting Global Behavior

If you configured synchronous_standby_names on the primary as shown in Step 1, the default synchronous_commit = on waits until the synchronous standby has flushed the WAL record to durable storage. remote_write waits until the standby has written the WAL record to the operating system, which is usually faster but not as strong if the standby host crashes before flushing. remote_apply waits until the standby has replayed the transaction, which is useful when your application reads from the standby immediately after writing to the primary.

For most zero-data-loss HA setups, on is the practical starting point. Use remote_apply only when read-after-write behavior on the standby matters enough to justify the extra latency.

# In postgresql.conf on Primary
synchronous_commit = on

Warning: Synchronous commit can noticeably increase write latency compared to asynchronous modes (off or local). The added latency comes from network round trips, standby WAL write speed, and, for remote_apply, replay speed.

2. Testing Within a Transaction

To test transactionally (without requiring a global configuration change), you can set it per session or transaction:

-- Connect to Primary

BEGIN;
SET LOCAL synchronous_commit = on;

INSERT INTO sales (item, amount) VALUES ('Widget A', 100);
-- This INSERT prepares WAL that must be acknowledged by the synchronous standby.

COMMIT;
-- The COMMIT succeeds only after the standby acknowledges the WAL write.

If no configured synchronous standby is available at commit time, the commit waits. That is the point of the feature, but it can surprise teams during an outage: the primary may still be up, yet writes appear frozen because PostgreSQL is waiting for a synchronous acknowledgment. There is no general PostgreSQL setting that automatically "falls back" to asynchronous commit when a synchronous standby disappears. If you want that behavior, your HA tooling must change synchronous_standby_names, or you must have a runbook for doing it manually after deciding that availability is more important than zero data loss.

A Safer Two-Standby Pattern

A single synchronous standby gives strong durability while everything is healthy, but it also creates a single point of write availability. If that standby is down, slow, or isolated from the primary, commits wait. In production, a common pattern is to run at least two standbys and require one synchronous acknowledgment:

synchronous_standby_names = 'ANY 1 (standby1, standby2)'

With this setup, either standby can satisfy the commit. If standby1 is rebooting, standby2 can still acknowledge writes. You still need to monitor both replicas, because a long outage on one standby can cause its replication slot to retain a large amount of WAL, but the primary is less likely to stall because of a single standby failure.

Requiring two acknowledgments is possible:

synchronous_standby_names = 'ANY 2 (standby1, standby2, standby3)'

That is a stricter durability choice. It is usually reserved for environments with very low-latency links and a clear reason to require more than one remote copy before commit. For many application databases, "any one of two nearby standbys" is the better balance.

What to Monitor After Enabling It

Do not stop at "the standby connects." Synchronous replication can be technically working while user-facing latency gets worse. Watch these signals after rollout:

SELECT
    application_name,
    client_addr,
    state,
    sync_state,
    write_lag,
    flush_lag,
    replay_lag
FROM pg_stat_replication;

On the primary, sync_state = 'sync' tells you which standby is currently synchronous. write_lag, flush_lag, and replay_lag help explain where time is going. If write_lag is high, suspect network or standby WAL write pressure. If flush_lag is high, suspect storage. If only replay_lag is high, the standby may be receiving WAL but applying it slowly because of I/O, CPU, locks, or long-running queries on the standby.

Also monitor slot retention:

SELECT
    slot_name,
    active,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots;

A replication slot protects a standby, but it does not protect your disk. If a slot is inactive and retained WAL keeps growing, either fix the standby quickly or drop the slot after confirming you no longer need it.

A Practical Rollout Plan

For a busy production system, treat synchronous replication as a staged change rather than a one-line config tweak.

First, build the standby asynchronously and let it run for a while. Confirm it can keep up during peak write periods. If it falls behind asynchronously, it will hurt commit latency when it becomes synchronous.

Second, set application_name and verify that the primary sees the standby exactly as you expect in pg_stat_replication. Spelling mistakes are common here because synchronous_standby_names matches the runtime application_name, not the hostname and not the slot.

Third, enable synchronous replication during a low-traffic window and watch commit latency from the application side. PostgreSQL metrics may look fine while the application connection pool backs up because transactions now hold connections a little longer.

Finally, write down the failure decision. If the synchronous standby is gone and the primary is waiting on commits, who is allowed to relax synchronous_standby_names? Under what conditions? How will you verify whether the old primary or old standby contains the newest data before rejoining nodes? These are operational decisions, not just database settings.

Best Practices for Synchronous HA

Use Dedicated Standbys: Only assign standbys that are physically close (low latency) to the primary to your synchronous replication list. High latency will show up directly in commit time.
Monitor Replication Lag: Even in synchronous mode, monitor the standby lag. A slow standby that is still technically 'sync' but taking too long to process WAL can still impact user experience.
Plan the Availability Tradeoff: Decide in advance whether an operator may temporarily remove a missing standby from synchronous_standby_names during an incident.
Use Multiple Standbys: For better write availability, configure synchronous_standby_names = 'ANY 1 (standby1, standby2)' so either standby can acknowledge commits.