Verified by Garnet Grid

Database Backup & Recovery Engineering

Design database backup strategies. Covers backup types, point-in-time recovery, RPO/RTO, WAL archiving, cross-region replication, backup testing, and disaster recovery for databases.

Database backups are like insurance: everyone knows they need them, nobody checks if they work until it’s too late. The most dangerous assumption in database engineering is “our backups are good.” Untested backups are not backups — they’re hope. This guide covers how to design backup strategies that actually work when you need them.

Backup Types

Type	Speed	Size	Recovery	Best For
Full backup	Slow	Large	Fast (self-contained)	Weekly baseline
Incremental	Fast	Small	Slow (needs full + all increments)	Daily between fulls
Differential	Medium	Medium	Medium (needs full + latest diff)	Daily alternative to incremental
Continuous (WAL/binlog)	Continuous	Small	Point-in-time recovery	Production databases
Logical (pg_dump)	Slow	Medium	Object-level restore	Schema migration, selective restore

PostgreSQL Backup Architecture

                    Continuous WAL Archiving
PostgreSQL ──── WAL files ──────▶ S3/GCS
     │                              │
     │                              ▼
     │                    Point-in-time recovery
     │                    (restore to any second)
     │
     ├── pg_basebackup (weekly)
     │   └── Full physical backup → S3
     │
     └── pg_dump (daily, critical schemas)
         └── Logical backup → S3

WAL Archiving Configuration

# postgresql.conf
wal_level = replica
archive_mode = on
archive_command = 'aws s3 cp %p s3://backups/wal/%f'
archive_timeout = 60  # Archive every 60 seconds minimum

RPO / RTO Planning

Tier	Data	RPO	RTO	Strategy
Critical	Financial, orders	0 (zero data loss)	< 1 hour	Synchronous replication + WAL archiving
Important	User data, content	< 1 hour	< 4 hours	Async replication + hourly backups
Standard	Analytics, logs	< 24 hours	< 24 hours	Daily backups
Archival	Historical data	< 1 week	< 1 week	Weekly backups to cold storage

Backup Testing

#!/bin/bash
# Monthly backup restore test

# 1. Restore latest backup to test environment
pg_restore --dbname=restore_test latest_backup.dump

# 2. Run consistency checks
psql restore_test -c "
  SELECT table_name, 
         (xpath('/row/cnt/text()', 
          query_to_xml('SELECT count(*) as cnt FROM ' || table_name, false, false, ''))
         )[1]::text::int as row_count
  FROM information_schema.tables 
  WHERE table_schema = 'public'
  ORDER BY table_name;
"

# 3. Compare row counts with production
# 4. Run application smoke tests against restored database
# 5. Log results and alert on discrepancies

Anti-Patterns

Anti-Pattern	Problem	Fix
Never testing restores	Discover backup is corrupted during disaster	Monthly restore tests, automated validation
Backups on same server	Server failure = data AND backup lost	Off-site backups (different region/cloud)
No point-in-time recovery	Can only restore to last backup (hours ago)	WAL archiving for continuous recovery
No backup monitoring	Backup silently fails for weeks	Alert on backup failures and missing backups
Logical backup only	Slow restore for large databases	Physical backups (pg_basebackup) for speed

Checklist

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For database consulting, visit garnetgrid.com. :::

Jakub Dimitri Rezayev

Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →