Database Backup & Recovery Engineering
Design database backup strategies. Covers backup types, point-in-time recovery, RPO/RTO, WAL archiving, cross-region replication, backup testing, and disaster recovery for databases.
Database backups are like insurance: everyone knows they need them, nobody checks if they work until it’s too late. The most dangerous assumption in database engineering is “our backups are good.” Untested backups are not backups — they’re hope. This guide covers how to design backup strategies that actually work when you need them.
Backup Types
| Type | Speed | Size | Recovery | Best For |
|---|---|---|---|---|
| Full backup | Slow | Large | Fast (self-contained) | Weekly baseline |
| Incremental | Fast | Small | Slow (needs full + all increments) | Daily between fulls |
| Differential | Medium | Medium | Medium (needs full + latest diff) | Daily alternative to incremental |
| Continuous (WAL/binlog) | Continuous | Small | Point-in-time recovery | Production databases |
| Logical (pg_dump) | Slow | Medium | Object-level restore | Schema migration, selective restore |
PostgreSQL Backup Architecture
Continuous WAL Archiving
PostgreSQL ──── WAL files ──────▶ S3/GCS
│ │
│ ▼
│ Point-in-time recovery
│ (restore to any second)
│
├── pg_basebackup (weekly)
│ └── Full physical backup → S3
│
└── pg_dump (daily, critical schemas)
└── Logical backup → S3
WAL Archiving Configuration
# postgresql.conf
wal_level = replica
archive_mode = on
archive_command = 'aws s3 cp %p s3://backups/wal/%f'
archive_timeout = 60 # Archive every 60 seconds minimum
RPO / RTO Planning
| Tier | Data | RPO | RTO | Strategy |
|---|---|---|---|---|
| Critical | Financial, orders | 0 (zero data loss) | < 1 hour | Synchronous replication + WAL archiving |
| Important | User data, content | < 1 hour | < 4 hours | Async replication + hourly backups |
| Standard | Analytics, logs | < 24 hours | < 24 hours | Daily backups |
| Archival | Historical data | < 1 week | < 1 week | Weekly backups to cold storage |
Backup Testing
#!/bin/bash
# Monthly backup restore test
# 1. Restore latest backup to test environment
pg_restore --dbname=restore_test latest_backup.dump
# 2. Run consistency checks
psql restore_test -c "
SELECT table_name,
(xpath('/row/cnt/text()',
query_to_xml('SELECT count(*) as cnt FROM ' || table_name, false, false, ''))
)[1]::text::int as row_count
FROM information_schema.tables
WHERE table_schema = 'public'
ORDER BY table_name;
"
# 3. Compare row counts with production
# 4. Run application smoke tests against restored database
# 5. Log results and alert on discrepancies
Anti-Patterns
| Anti-Pattern | Problem | Fix |
|---|---|---|
| Never testing restores | Discover backup is corrupted during disaster | Monthly restore tests, automated validation |
| Backups on same server | Server failure = data AND backup lost | Off-site backups (different region/cloud) |
| No point-in-time recovery | Can only restore to last backup (hours ago) | WAL archiving for continuous recovery |
| No backup monitoring | Backup silently fails for weeks | Alert on backup failures and missing backups |
| Logical backup only | Slow restore for large databases | Physical backups (pg_basebackup) for speed |
Checklist
- Backup strategy defined per data tier (RPO/RTO)
- WAL archiving configured for point-in-time recovery
- Physical backups (pg_basebackup) weekly, offsite
- Logical backups (pg_dump) for critical schemas
- Cross-region backup replication
- Backup encryption at rest
- Restore testing: monthly, automated
- Backup monitoring: alerts on failure or missed schedule
- Retention policy: how long to keep backups
- Recovery runbook documented and tested
:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For database consulting, visit garnetgrid.com. :::