How to Manage Database Schema Migrations Safely

Schema migrations are the scariest deploys. A bad migration can take down production, corrupt data, or lock tables for hours. Application deployments can be rolled back in seconds — drop a bad schema change and you’re looking at a multi-hour incident, potential data loss, and a post-mortem that nobody wants to write.

This guide covers how to make schema migrations boring and safe: the golden rules, the expand-and-contract pattern, migration testing, lock-free operations, and the rollback strategies that keep production running.

The Golden Rules

Every migration must be backward compatible — old code must work with the new schema, because during deployment both versions run simultaneously
Never rename or drop columns in one step — use the expand-and-contract pattern (three deploys, not one)
Always test migrations against a production-size dataset — a migration that takes 200ms on your dev laptop can lock tables for 45 minutes on 50 million rows
Have a rollback plan before you start — write the rollback migration and test it before deploying forward
Never mix schema changes and data changes in one migration — separate DDL (structure) from DML (data) for cleaner rollbacks

Step 1: Expand and Contract Pattern

The expand-and-contract pattern is the safest way to make schema changes. It separates the change into three deploys, ensuring backward compatibility at every step.

Adding a Column (Safe — Single Deploy)

-- Step 1: Add column as nullable (instant operation in PostgreSQL 11+)
ALTER TABLE users ADD COLUMN display_name VARCHAR(255);

-- Step 2: Backfill data in batches (not a single UPDATE on millions of rows)
UPDATE users SET display_name = first_name || ' ' || last_name
WHERE display_name IS NULL
  AND id BETWEEN 1 AND 100000;
-- Repeat for next batches...

-- Step 3: Once backfill complete and app writes to new column, add constraint
ALTER TABLE users ALTER COLUMN display_name SET NOT NULL;
ALTER TABLE users ALTER COLUMN display_name SET DEFAULT '';

Renaming a Column (Dangerous — Requires Three Deploys)

-- ❌ WRONG: This breaks all existing code INSTANTLY
ALTER TABLE users RENAME COLUMN name TO full_name;

-- ✅ RIGHT: Expand-and-contract over 3 deploys

-- Deploy 1 (Schema): Add new column, start dual-write
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);
UPDATE users SET full_name = name;  -- Backfill
-- App code: Write to BOTH 'name' and 'full_name', read from 'full_name'
-- Create trigger to keep columns in sync during transition:
CREATE OR REPLACE FUNCTION sync_name_columns() RETURNS TRIGGER AS $$
BEGIN
    IF NEW.name IS DISTINCT FROM OLD.name THEN
        NEW.full_name := NEW.name;
    END IF;
    IF NEW.full_name IS DISTINCT FROM OLD.full_name THEN
        NEW.name := NEW.full_name;
    END IF;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

-- Deploy 2 (Code): Stop reading 'name', read/write only 'full_name'
-- Remove trigger, stop writing to 'name'

-- Deploy 3 (Cleanup): Drop old column after verification period (1-2 weeks)
ALTER TABLE users DROP COLUMN name;

Dropping a Table or Column (Requires Soft-Delete First)

Step	Deploy	What Happens
1. Stop writing	Code deploy	App no longer writes to column/table
2. Stop reading	Code deploy	App no longer reads from column/table
3. Wait	1-2 weeks	Verify nothing breaks, check logs for access
4. Drop	Migration deploy	Remove the column/table

Step 2: Migration Framework

Alembic (Python / SQLAlchemy)

# alembic/versions/001_add_display_name.py
"""Add display_name column"""
revision = '001'
down_revision = None

from alembic import op
import sqlalchemy as sa

def upgrade():
    op.add_column('users',
        sa.Column('display_name', sa.String(255), nullable=True)
    )

def downgrade():
    op.drop_column('users', 'display_name')

# Migration lifecycle commands
alembic upgrade head          # Apply all pending migrations
alembic downgrade -1          # Rollback last migration
alembic history               # Show migration history
alembic current               # Show current version
alembic revision --autogenerate -m "add display_name"  # Generate from model diff

Other Migration Frameworks

Framework	Language	Key Features
Alembic	Python	Auto-generate from SQLAlchemy models, branching
Flyway	Java/JVM	SQL-based, versioned, no ORM dependency
Liquibase	Java/JVM	XML/YAML/SQL, rollback support, diff
Knex.js	Node.js	JavaScript migrations, Promise-based
golang-migrate	Go	CLI + library, database-agnostic
Prisma Migrate	TypeScript	Schema-first, shadow database for diffing
ActiveRecord	Ruby	Mature, reversible migrations, Rails native

Step 3: Testing Migrations

Never deploy a migration without testing it against production-size data. What takes milliseconds on 1,000 rows can take hours on 100 million.

# Estimate migration time on production-size data
import time
import psycopg2

def test_migration_timing(conn, migration_sql, table_name):
    """Run migration on a sample and extrapolate to full dataset"""
    cur = conn.cursor()

    # Get row count
    cur.execute(f"SELECT COUNT(*) FROM {table_name}")
    row_count = cur.fetchone()[0]

    # Time the migration on a 100K sample
    sample_size = min(100000, row_count)
    start = time.time()
    cur.execute(f"""
        CREATE TEMP TABLE test_migration AS
        SELECT * FROM {table_name} LIMIT {sample_size}
    """)
    cur.execute(migration_sql.replace(table_name, "test_migration"))
    elapsed = time.time() - start

    # Extrapolate
    estimated_total = (elapsed / sample_size) * row_count
    print(f"Rows: {row_count:,}")
    print(f"Sample time: {elapsed:.2f}s ({sample_size:,} rows)")
    print(f"Estimated full: {estimated_total:.0f}s ({estimated_total/60:.1f} min)")

    if estimated_total > 300:  # > 5 minutes
        print("⚠️  WARNING: Consider batched migration or online DDL tool")
    if estimated_total > 3600:  # > 1 hour
        print("🚨 CRITICAL: Must use background migration with batching")

    conn.rollback()  # Don't actually apply

What to Test

Test	Why	How
Timing on production-size data	Catch 45-minute table locks	Sample + extrapolate
Backward compatibility	Old code must work	Run old test suite against new schema
Forward compatibility	Rollback must work	Apply, then rollback, verify data integrity
Lock contention	Exclusive locks block all queries	Monitor `pg_stat_activity` during test
Data integrity	Backfills must be correct	COUNT, SUM, spot-check after migration

Step 4: Lock-Free Migrations

Large tables require special techniques to avoid blocking production queries.

-- PostgreSQL: CREATE INDEX CONCURRENTLY (no table lock)
-- Regular CREATE INDEX locks the table for writes!
CREATE INDEX CONCURRENTLY idx_orders_customer
ON orders (customer_id);
-- Takes longer, but allows reads AND writes during creation

-- Add column with default (PostgreSQL 11+ is INSTANT — no rewrite)
ALTER TABLE orders ADD COLUMN priority INTEGER DEFAULT 0;
-- Pre-PG11 this would rewrite the entire table!

-- Batch updates instead of single UPDATE on millions of rows
DO $$
DECLARE
    batch_size INT := 10000;
    rows_updated INT;
    total_updated INT := 0;
BEGIN
    LOOP
        UPDATE users
        SET display_name = first_name || ' ' || last_name
        WHERE display_name IS NULL
          AND id IN (
            SELECT id FROM users
            WHERE display_name IS NULL
            LIMIT batch_size
            FOR UPDATE SKIP LOCKED  -- Skip rows being modified by others
          );
        GET DIAGNOSTICS rows_updated = ROW_COUNT;
        total_updated := total_updated + rows_updated;
        EXIT WHEN rows_updated = 0;
        PERFORM pg_sleep(0.1);  -- Brief pause to reduce CPU/IO pressure
        RAISE NOTICE 'Updated % rows (total: %)', rows_updated, total_updated;
    END LOOP;
END $$;

Operations That Lock Tables

Operation	Lock Type	Duration	Safe Alternative
`CREATE INDEX`	ShareLock (blocks writes)	Minutes-hours	`CREATE INDEX CONCURRENTLY`
`ALTER TABLE ADD COLUMN` (with default, pre-PG11)	AccessExclusiveLock	Minutes-hours	Upgrade to PG11+ or add nullable first
`ALTER TABLE ADD NOT NULL`	AccessExclusiveLock	Fast if column exists	Add `CHECK` constraint first, then validate
`ALTER TABLE SET DATA TYPE`	AccessExclusiveLock	Rewrites table	Add new column, dual-write, drop old
`UPDATE` on full table	RowExclusiveLock	Hours on large tables	Batch updates with `LIMIT` + loop

Step 5: Online DDL Tools for MySQL

MySQL has more table-locking operations than PostgreSQL. Use online DDL tools for large tables:

# pt-online-schema-change (Percona Toolkit)
pt-online-schema-change \
  --alter "ADD COLUMN display_name VARCHAR(255)" \
  --execute \
  D=mydb,t=users

# gh-ost (GitHub's Online Schema Migration)
gh-ost \
  --alter "ADD COLUMN display_name VARCHAR(255)" \
  --database=mydb \
  --table=users \
  --execute

Migration Safety Checklist

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For database consulting, visit garnetgrid.com. :::