A single PostgreSQL primary is fine until it isn't. Disk failure, host crash, kernel panic โ any of these on your only database server means downtime while you restore from backup. Streaming replication changes the equation: a hot standby receives every WAL record from the primary in real time and can be promoted to primary in seconds. Add Patroni for automated failover and pgBouncer for connection pooling and you have a production-grade HA setup that handles the failure scenarios that will eventually happen.
Streaming replication fundamentals
PostgreSQL's streaming replication works at the WAL (Write-Ahead Log) level. Every write on the primary generates WAL records. The standby connects to the primary via a replication slot, receives these WAL records as a stream, and applies them continuously โ staying within seconds of the primary at all times.
# postgresql.conf on the PRIMARY
wal_level = replica # Enable replication WAL content
max_wal_senders = 5 # Max concurrent standby connections
wal_keep_size = 1GB # Keep this much WAL for slow standbys
synchronous_commit = on # Wait for standby to confirm receipt (safer)
# For async replication (higher throughput, small data-loss window):
# synchronous_commit = off
# Create a replication user
CREATE USER replicator REPLICATION LOGIN ENCRYPTED PASSWORD 'strong-password-here';
# pg_hba.conf โ allow standby to connect for replication
host replication replicator 10.0.2.11/32 scram-sha-256
# On the STANDBY โ take a base backup from primary
pg_basebackup \
--host=db-primary.internal \
--username=replicator \
--pgdata=/var/lib/postgresql/16/main \
--wal-method=stream \
--checkpoint=fast \
--progress \
--verbose
# postgresql.conf on STANDBY
hot_standby = on # Allow read queries on standby
primary_conninfo = 'host=db-primary.internal user=replicator password=xxx sslmode=require'
# Signal file that tells PostgreSQL this is a standby
touch /var/lib/postgresql/16/main/standby.signal
Replication slots: don't lose WAL the standby needs
Without a replication slot, the primary may recycle WAL before the standby has consumed it โ if the standby falls behind (network issue, heavy load), it can become so far behind that it needs a full re-sync. Replication slots make the primary retain WAL until the named slot has consumed it:
-- Create a replication slot on the primary
SELECT pg_create_physical_replication_slot('standby_01');
-- Check slot state
SELECT slot_name, active, restart_lsn, pg_size_pretty(
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
) AS lag
FROM pg_replication_slots;
-- Update standby's primary_conninfo to use the slot
primary_slot_name = 'standby_01' -- in postgresql.conf on standby
Monitor slot lag carefully. An inactive replication slot with a fallen-behind standby causes WAL accumulation on the primary โ if not caught, this fills disk and brings down the primary too. Alert when pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) > 5GB on any slot. Drop slots for permanently offline standbys.
pgBouncer: connection pooling for PostgreSQL
PostgreSQL spawns a process per connection. Under heavy load (hundreds of application connections), this overhead is significant. pgBouncer sits in front of PostgreSQL and multiplexes many client connections onto a smaller pool of actual database connections:
# /etc/pgbouncer/pgbouncer.ini
[databases]
myapp = host=127.0.0.1 port=5432 dbname=myapp
[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = scram-sha-256
auth_file = /etc/pgbouncer/userlist.txt
pool_mode = transaction # Most efficient โ connection returned to pool after each transaction
max_client_conn = 1000 # Max application connections pgBouncer accepts
default_pool_size = 25 # Actual connections to PostgreSQL per database
min_pool_size = 5
reserve_pool_size = 5
reserve_pool_timeout = 3
# For read replicas โ separate pool pointing at standby
[databases]
myapp_ro = host=db-standby.internal port=5432 dbname=myapp
Patroni: automated failover
Patroni manages the primary/standby roles and handles automatic failover using a distributed consensus store (etcd, Consul, or ZooKeeper) as the source of truth for who is the current primary:
# /etc/patroni/config.yml โ on every database node
scope: myapp-cluster
namespace: /db/
name: db-01 # Unique per node
restapi:
listen: 0.0.0.0:8008
connect_address: 10.0.2.10:8008
etcd3:
hosts: etcd-01:2379,etcd-02:2379,etcd-03:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 30
maximum_lag_on_failover: 1048576 # 1MB โ don't promote a far-behind standby
postgresql:
listen: 0.0.0.0:5432
connect_address: 10.0.2.10:5432
data_dir: /var/lib/postgresql/16/main
authentication:
replication:
username: replicator
password: "{{ replication_password }}"
superuser:
username: postgres
password: "{{ postgres_password }}"
With Patroni running, automatic failover happens in under 30 seconds. When the primary fails, the Patroni cluster elects the most up-to-date standby, promotes it to primary, and updates the etcd key. pgBouncer (or HAProxy) reads the Patroni REST API to redirect connections to the new primary automatically.
In 47Network deployments requiring database HA, we use the Patroni + pgBouncer stack on all production PostgreSQL clusters. The fintech zero-trust engagement runs a 3-node Patroni cluster with synchronous replication to one standby and async to the second โ zero data loss on planned switchovers, sub-30-second failover on unplanned primary failure. Application connection strings point to pgBouncer, which stays stable across failovers.