Most schema migrations that cause downtime don't have to. The operations that lock tables โ adding a NOT NULL column, creating an index on a large table, renaming a column in use โ all have safe alternatives that work on live databases under full production traffic. The key is separating schema changes from application changes and deploying them in stages. This post covers the expand/contract pattern, how to create indexes without locking, and the specific patterns for the operations that trip teams up most often.
The expand/contract pattern
Any breaking schema change can be made safely by splitting it across multiple deployments. The pattern has three phases:
- Expand: Add the new structure alongside the old. Both old and new application code can work. No existing queries break.
- Migrate: Backfill data from old to new structure. Deploy application code that writes to both old and new.
- Contract: Remove the old structure once all application code has been updated and no longer references it.
Adding a column safely
-- DANGEROUS on PostgreSQL < 11: rewrites entire table, takes exclusive lock
-- (PostgreSQL 11+ made this safe for columns with volatile defaults)
ALTER TABLE orders ADD COLUMN processed_at TIMESTAMPTZ NOT NULL DEFAULT now();
-- SAFE on any PostgreSQL version: nullable column, no default (instant)
ALTER TABLE orders ADD COLUMN processed_at TIMESTAMPTZ;
-- Then backfill in batches (never UPDATE without WHERE on a large table)
DO $$
DECLARE
batch_size INT := 1000;
last_id BIGINT := 0;
max_id BIGINT;
BEGIN
SELECT MAX(id) INTO max_id FROM orders;
WHILE last_id < max_id LOOP
UPDATE orders
SET processed_at = created_at -- or whatever the backfill logic is
WHERE id > last_id AND id <= last_id + batch_size
AND processed_at IS NULL;
last_id := last_id + batch_size;
PERFORM pg_sleep(0.05); -- brief pause to avoid overwhelming I/O
END LOOP;
END $$;
-- After backfill is complete and app always writes the column:
-- Add NOT NULL constraint without a full table rewrite (validate separately)
ALTER TABLE orders ADD CONSTRAINT orders_processed_at_not_null
CHECK (processed_at IS NOT NULL) NOT VALID;
-- Then validate without blocking reads:
ALTER TABLE orders VALIDATE CONSTRAINT orders_processed_at_not_null;
Creating indexes without locking
-- DANGEROUS: locks the entire table for potentially minutes/hours
CREATE INDEX idx_orders_user_id ON orders (user_id);
-- SAFE: uses a background index build, non-blocking
CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders (user_id);
-- This is slower and uses more resources, but the table remains
-- fully available for reads and writes throughout.
-- If the concurrent build fails, it leaves an INVALID index:
SELECT indexname, indexdef FROM pg_indexes
WHERE tablename = 'orders' AND indexname LIKE '%user_id%';
-- Check pg_index: indisvalid = false means rebuild is needed
DROP INDEX CONCURRENTLY idx_orders_user_id; -- and try again
Renaming a column (the expand/contract way)
-- You cannot rename a column without a full deployment cycle if code uses the old name
-- Phase 1 (Expand): Add the new column, sync writes with a trigger
ALTER TABLE users ADD COLUMN full_name TEXT;
CREATE OR REPLACE FUNCTION sync_full_name() RETURNS TRIGGER AS $$
BEGIN
IF TG_OP = 'INSERT' OR TG_OP = 'UPDATE' THEN
NEW.full_name := NEW.name; -- sync old->new on every write
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER sync_full_name_trigger
BEFORE INSERT OR UPDATE ON users
FOR EACH ROW EXECUTE FUNCTION sync_full_name();
-- Phase 2 (Migrate): Backfill, deploy app that writes to both columns
-- Phase 3 (Contract): Once all code uses full_name, drop old column + trigger
DROP TRIGGER sync_full_name_trigger ON users;
DROP FUNCTION sync_full_name();
ALTER TABLE users DROP COLUMN name;
Lock timeout: the safety net for all migrations
Any ALTER TABLE acquires a lock, even brief ones. If long-running transactions are holding a conflicting lock, your migration waits โ and while it waits, subsequent queries pile up behind it, creating a queue that looks like an outage. Set a lock timeout before every migration in production:
-- Set in the session before running the migration
SET lock_timeout = '3s'; -- Fail immediately if lock not acquired in 3s
SET statement_timeout = '30s'; -- Also set a statement timeout
-- Example migration file header
BEGIN;
SET lock_timeout = '3s';
SET statement_timeout = '60s';
ALTER TABLE orders ADD COLUMN ... ;
COMMIT;
In 47Network Studio engagements, all database migrations follow this pattern: nullable first, backfill in batches with a sleep, validate constraints separately, CONCURRENTLY for all index creation, and lock_timeout set on every ALTER. The CI pipeline runs migrations in a --dry-run mode against a production-sized anonymised snapshot to catch time estimates before the actual deployment window.