How to Manage Database Schema Migrations Safely
Deploy database changes without downtime. Covers blue-green migrations, backward compatibility, rollback strategies, and migration testing.
Schema migrations are the scariest deploys. A bad migration can take down production, corrupt data, or lock tables for hours. Application deployments can be rolled back in seconds — drop a bad schema change and you’re looking at a multi-hour incident, potential data loss, and a post-mortem that nobody wants to write.
This guide covers how to make schema migrations boring and safe: the golden rules, the expand-and-contract pattern, migration testing, lock-free operations, and the rollback strategies that keep production running.
The Golden Rules
- Every migration must be backward compatible — old code must work with the new schema, because during deployment both versions run simultaneously
- Never rename or drop columns in one step — use the expand-and-contract pattern (three deploys, not one)
- Always test migrations against a production-size dataset — a migration that takes 200ms on your dev laptop can lock tables for 45 minutes on 50 million rows
- Have a rollback plan before you start — write the rollback migration and test it before deploying forward
- Never mix schema changes and data changes in one migration — separate DDL (structure) from DML (data) for cleaner rollbacks
Step 1: Expand and Contract Pattern
The expand-and-contract pattern is the safest way to make schema changes. It separates the change into three deploys, ensuring backward compatibility at every step.
Adding a Column (Safe — Single Deploy)
-- Step 1: Add column as nullable (instant operation in PostgreSQL 11+)
ALTER TABLE users ADD COLUMN display_name VARCHAR(255);
-- Step 2: Backfill data in batches (not a single UPDATE on millions of rows)
UPDATE users SET display_name = first_name || ' ' || last_name
WHERE display_name IS NULL
AND id BETWEEN 1 AND 100000;
-- Repeat for next batches...
-- Step 3: Once backfill complete and app writes to new column, add constraint
ALTER TABLE users ALTER COLUMN display_name SET NOT NULL;
ALTER TABLE users ALTER COLUMN display_name SET DEFAULT '';
Renaming a Column (Dangerous — Requires Three Deploys)
-- ❌ WRONG: This breaks all existing code INSTANTLY
ALTER TABLE users RENAME COLUMN name TO full_name;
-- ✅ RIGHT: Expand-and-contract over 3 deploys
-- Deploy 1 (Schema): Add new column, start dual-write
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);
UPDATE users SET full_name = name; -- Backfill
-- App code: Write to BOTH 'name' and 'full_name', read from 'full_name'
-- Create trigger to keep columns in sync during transition:
CREATE OR REPLACE FUNCTION sync_name_columns() RETURNS TRIGGER AS $$
BEGIN
IF NEW.name IS DISTINCT FROM OLD.name THEN
NEW.full_name := NEW.name;
END IF;
IF NEW.full_name IS DISTINCT FROM OLD.full_name THEN
NEW.name := NEW.full_name;
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
-- Deploy 2 (Code): Stop reading 'name', read/write only 'full_name'
-- Remove trigger, stop writing to 'name'
-- Deploy 3 (Cleanup): Drop old column after verification period (1-2 weeks)
ALTER TABLE users DROP COLUMN name;
Dropping a Table or Column (Requires Soft-Delete First)
| Step | Deploy | What Happens |
|---|---|---|
| 1. Stop writing | Code deploy | App no longer writes to column/table |
| 2. Stop reading | Code deploy | App no longer reads from column/table |
| 3. Wait | 1-2 weeks | Verify nothing breaks, check logs for access |
| 4. Drop | Migration deploy | Remove the column/table |
Step 2: Migration Framework
Alembic (Python / SQLAlchemy)
# alembic/versions/001_add_display_name.py
"""Add display_name column"""
revision = '001'
down_revision = None
from alembic import op
import sqlalchemy as sa
def upgrade():
op.add_column('users',
sa.Column('display_name', sa.String(255), nullable=True)
)
def downgrade():
op.drop_column('users', 'display_name')
# Migration lifecycle commands
alembic upgrade head # Apply all pending migrations
alembic downgrade -1 # Rollback last migration
alembic history # Show migration history
alembic current # Show current version
alembic revision --autogenerate -m "add display_name" # Generate from model diff
Other Migration Frameworks
| Framework | Language | Key Features |
|---|---|---|
| Alembic | Python | Auto-generate from SQLAlchemy models, branching |
| Flyway | Java/JVM | SQL-based, versioned, no ORM dependency |
| Liquibase | Java/JVM | XML/YAML/SQL, rollback support, diff |
| Knex.js | Node.js | JavaScript migrations, Promise-based |
| golang-migrate | Go | CLI + library, database-agnostic |
| Prisma Migrate | TypeScript | Schema-first, shadow database for diffing |
| ActiveRecord | Ruby | Mature, reversible migrations, Rails native |
Step 3: Testing Migrations
Never deploy a migration without testing it against production-size data. What takes milliseconds on 1,000 rows can take hours on 100 million.
# Estimate migration time on production-size data
import time
import psycopg2
def test_migration_timing(conn, migration_sql, table_name):
"""Run migration on a sample and extrapolate to full dataset"""
cur = conn.cursor()
# Get row count
cur.execute(f"SELECT COUNT(*) FROM {table_name}")
row_count = cur.fetchone()[0]
# Time the migration on a 100K sample
sample_size = min(100000, row_count)
start = time.time()
cur.execute(f"""
CREATE TEMP TABLE test_migration AS
SELECT * FROM {table_name} LIMIT {sample_size}
""")
cur.execute(migration_sql.replace(table_name, "test_migration"))
elapsed = time.time() - start
# Extrapolate
estimated_total = (elapsed / sample_size) * row_count
print(f"Rows: {row_count:,}")
print(f"Sample time: {elapsed:.2f}s ({sample_size:,} rows)")
print(f"Estimated full: {estimated_total:.0f}s ({estimated_total/60:.1f} min)")
if estimated_total > 300: # > 5 minutes
print("⚠️ WARNING: Consider batched migration or online DDL tool")
if estimated_total > 3600: # > 1 hour
print("🚨 CRITICAL: Must use background migration with batching")
conn.rollback() # Don't actually apply
What to Test
| Test | Why | How |
|---|---|---|
| Timing on production-size data | Catch 45-minute table locks | Sample + extrapolate |
| Backward compatibility | Old code must work | Run old test suite against new schema |
| Forward compatibility | Rollback must work | Apply, then rollback, verify data integrity |
| Lock contention | Exclusive locks block all queries | Monitor pg_stat_activity during test |
| Data integrity | Backfills must be correct | COUNT, SUM, spot-check after migration |
Step 4: Lock-Free Migrations
Large tables require special techniques to avoid blocking production queries.
-- PostgreSQL: CREATE INDEX CONCURRENTLY (no table lock)
-- Regular CREATE INDEX locks the table for writes!
CREATE INDEX CONCURRENTLY idx_orders_customer
ON orders (customer_id);
-- Takes longer, but allows reads AND writes during creation
-- Add column with default (PostgreSQL 11+ is INSTANT — no rewrite)
ALTER TABLE orders ADD COLUMN priority INTEGER DEFAULT 0;
-- Pre-PG11 this would rewrite the entire table!
-- Batch updates instead of single UPDATE on millions of rows
DO $$
DECLARE
batch_size INT := 10000;
rows_updated INT;
total_updated INT := 0;
BEGIN
LOOP
UPDATE users
SET display_name = first_name || ' ' || last_name
WHERE display_name IS NULL
AND id IN (
SELECT id FROM users
WHERE display_name IS NULL
LIMIT batch_size
FOR UPDATE SKIP LOCKED -- Skip rows being modified by others
);
GET DIAGNOSTICS rows_updated = ROW_COUNT;
total_updated := total_updated + rows_updated;
EXIT WHEN rows_updated = 0;
PERFORM pg_sleep(0.1); -- Brief pause to reduce CPU/IO pressure
RAISE NOTICE 'Updated % rows (total: %)', rows_updated, total_updated;
END LOOP;
END $$;
Operations That Lock Tables
| Operation | Lock Type | Duration | Safe Alternative |
|---|---|---|---|
CREATE INDEX | ShareLock (blocks writes) | Minutes-hours | CREATE INDEX CONCURRENTLY |
ALTER TABLE ADD COLUMN (with default, pre-PG11) | AccessExclusiveLock | Minutes-hours | Upgrade to PG11+ or add nullable first |
ALTER TABLE ADD NOT NULL | AccessExclusiveLock | Fast if column exists | Add CHECK constraint first, then validate |
ALTER TABLE SET DATA TYPE | AccessExclusiveLock | Rewrites table | Add new column, dual-write, drop old |
UPDATE on full table | RowExclusiveLock | Hours on large tables | Batch updates with LIMIT + loop |
Step 5: Online DDL Tools for MySQL
MySQL has more table-locking operations than PostgreSQL. Use online DDL tools for large tables:
# pt-online-schema-change (Percona Toolkit)
pt-online-schema-change \
--alter "ADD COLUMN display_name VARCHAR(255)" \
--execute \
D=mydb,t=users
# gh-ost (GitHub's Online Schema Migration)
gh-ost \
--alter "ADD COLUMN display_name VARCHAR(255)" \
--database=mydb \
--table=users \
--execute
Migration Safety Checklist
- Migration is backward compatible with currently deployed code
- Tested on production-size dataset (timing verified, < 5 min goal)
- Rollback migration written AND tested before deploying forward
- No table-level locks on tables > 100K rows
-
CREATE INDEX CONCURRENTLYused (notCREATE INDEX) - Batch updates used for large backfills (not single UPDATE)
- Schema and data changes in separate migrations
- Migration reviewed by DBA or senior engineer
- Deployed during low-traffic window (if lock-free isn’t possible)
- Monitoring in place during and after migration (query latency, lock waits)
- Verification queries confirm data integrity post-migration
:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For database consulting, visit garnetgrid.com. :::