Database Migration Strategies: Changing Schemas Without Breaking Production
Migrate database schemas safely in production without downtime, data loss, or broken deployments. Covers expand-and-contract, backward-compatible migrations, zero-downtime techniques, rollback strategies, data migration patterns, and the deployment workflow that lets you change your schema while traffic is flowing.
Database migrations are the most dangerous part of a deployment. Code changes can be rolled back in seconds. Schema changes can take hours to undo, and some — like dropping a column that had data — cannot be undone at all. The challenge is that your database schema needs to evolve as your application evolves, but you cannot take your application offline every time you need to add a column.
The solution is a set of patterns that make every schema change backward-compatible, so the old code and new code both work with the database at every step of the migration.
The Fundamental Problem
The dangerous moment:
v1 code Database v2 code
─────────────────────────────────────────────────
SELECT name, users table SELECT name,
email id email,
FROM users name phone ← NEW
email FROM users
If you deploy v2 code BEFORE adding the phone column → errors
If you add phone column BEFORE deploying v2 code → safe (column exists but unused)
If you drop email column BEFORE deploying v1 rollback code → disaster
Expand-and-Contract Pattern
The core technique for zero-downtime migrations:
Phase 1: EXPAND (add new structure)
┌──────────────────────────────────────┐
│ Add new column (nullable) │
│ Add new table │
│ Add new index (concurrently) │
│ │
│ Old code: still works (ignores new) │
│ New code: can use new structure │
└──────────────────────────────────────┘
│
▼
Phase 2: MIGRATE (move data)
┌──────────────────────────────────────┐
│ Backfill new columns │
│ Dual-write to old and new │
│ Verify data consistency │
└──────────────────────────────────────┘
│
▼
Phase 3: CONTRACT (remove old structure)
┌──────────────────────────────────────┐
│ Remove old code paths │
│ Drop old columns/tables │
│ Remove dual-write logic │
│ │
│ Only after all code uses new path │
└──────────────────────────────────────┘
Common Migration Patterns
Adding a Column
-- ✅ Safe: add nullable column (no lock, no rewrite)
ALTER TABLE users ADD COLUMN phone VARCHAR(20);
-- ❌ Unsafe: add column with DEFAULT (rewrites entire table in old PostgreSQL)
-- In PostgreSQL 11+, this is safe for non-volatile defaults
ALTER TABLE users ADD COLUMN phone VARCHAR(20) DEFAULT '';
-- ❌ Unsafe: add NOT NULL column without default
ALTER TABLE users ADD COLUMN phone VARCHAR(20) NOT NULL;
-- Fails if table has existing rows
Renaming a Column
-- Never rename directly. Use expand-and-contract:
-- Step 1: Add new column
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);
-- Step 2: Backfill data
UPDATE users SET full_name = name WHERE full_name IS NULL;
-- (do in batches for large tables)
-- Step 3: Deploy code using full_name (dual-read from both)
-- Step 4: Stop writing to old column
-- Step 5: Drop old column (weeks later, after verification)
ALTER TABLE users DROP COLUMN name;
Adding an Index
-- ❌ Unsafe: locks table for duration of index build
CREATE INDEX idx_users_email ON users (email);
-- ✅ Safe: build without locking
CREATE INDEX CONCURRENTLY idx_users_email ON users (email);
-- Takes longer but does not block reads or writes
-- Cannot run inside a transaction
Data Migration Best Practices
# Batch processing for large tables
def backfill_phone_numbers(batch_size=1000):
"""Backfill new phone column from legacy contact_info JSON."""
last_id = 0
total_updated = 0
while True:
# Process in batches to avoid long-running transactions
result = db.execute("""
UPDATE users
SET phone = contact_info->>'phone'
WHERE id > %s
AND id <= (SELECT MIN(id) + %s FROM users WHERE id > %s)
AND phone IS NULL
AND contact_info->>'phone' IS NOT NULL
RETURNING id
""", (last_id, batch_size, last_id))
updated = result.rowcount
total_updated += updated
if updated == 0:
break
last_id += batch_size
# Log progress
logger.info(f"Backfilled {total_updated} rows so far")
# Small sleep to reduce database load
time.sleep(0.1)
logger.info(f"Backfill complete: {total_updated} rows updated")
Migration Safety Checklist
| Check | Why |
|---|---|
| Migration is backward-compatible | Old code must still work after migration runs |
| Large table operations are batched | Avoid locking the table for minutes/hours |
| Indexes created CONCURRENTLY | Prevent table locks during index creation |
| Migration has been tested against production-size data | 1M rows behaves differently than 100 |
| Rollback plan exists | What happens if this migration fails halfway? |
| Monitoring in place | Watch for lock waits, slow queries after migration |
Migration Tools
| Tool | Language | Key Feature |
|---|---|---|
| Alembic | Python (SQLAlchemy) | Auto-generation from models |
| Flyway | Java / Any | SQL-based, versioned migrations |
| Liquibase | Java / Any | XML/YAML/SQL, database-agnostic |
| golang-migrate | Go | Simple CLI, SQL files |
| Prisma Migrate | TypeScript/Node | Schema-first, preview changes |
| Django Migrations | Python (Django) | Auto-detection from models |
Anti-Patterns
| Anti-Pattern | Problem | Fix |
|---|---|---|
| Migration and code in same deploy | If migration fails, code is also broken | Deploy migration first, code second |
| Manual schema changes | Untracked, unreproducible | All changes through migration files |
| Irreversible migrations without backup | Cannot undo if something goes wrong | Always test rollback, backup before destructive changes |
| One big migration | Hours to run, hard to debug failures | Break into small, incremental steps |
| NOT NULL without default | Fails on existing data | Add nullable first, backfill, then add constraint |
Implementation Checklist
- Use expand-and-contract for all schema changes in production
- Deploy migrations separately from application code
- Add columns as nullable first, backfill data, then add constraints
- Create indexes with CONCURRENTLY to avoid table locks
- Batch large data migrations (1000-10000 rows at a time)
- Test migrations against production-size datasets before deploying
- Maintain a rollback script for every migration
- Never rename or drop columns in the same deploy as the code change
- Monitor database lock waits and query performance during migrations
- Version-control all migration files alongside application code