How to Build a Disaster Recovery Plan

A disaster recovery plan that hasn’t been tested isn’t a plan — it’s a theory. 37% of organizations fail their first real DR activation because they never practiced. The backup exists but nobody knows the restore procedure. The failover is configured but DNS TTL is 24 hours.

This guide covers how to define recovery objectives, implement backup and replication strategies, automate failover, calculate DR costs, and test your plan before you actually need it.

Step 1: Define RTO and RPO Targets

Term	Definition	Business Question	Technical Implication
RTO (Recovery Time Objective)	Max acceptable downtime	”How long can we be down?”	Determines failover strategy
RPO (Recovery Point Objective)	Max acceptable data loss	”How much data can we lose?”	Determines backup frequency

Tiers by Business Criticality

Tier	Systems	RTO	RPO	Strategy	Cost (% of prod)
Tier 1: Mission Critical	Payment processing, auth, core API	< 15 min	0 (zero data loss)	Active-Active / Hot Standby	80-100%
Tier 2: Business Critical	CRM, ERP, dashboards	< 1 hour	< 15 min	Warm Standby	30-50%
Tier 3: Essential	Email, file shares, internal tools	< 4 hours	< 1 hour	Pilot Light	10-20%
Tier 4: Non-Critical	Dev environments, archives	< 24 hours	< 24 hours	Backup & Restore	5-10%

How to Classify Systems

Does the system directly generate revenue?
├── YES → Tier 1 (payment, checkout, core product)
└── NO → Continue

Does the system affect customer experience?
├── YES → Tier 2 (CRM, support tools, dashboards)
└── NO → Continue

Do employees depend on it daily?
├── YES → Tier 3 (email, file storage, internal apps)
└── NO → Tier 4 (dev tools, archives)

Step 2: Implement Backup Strategy

Database Backups

# AWS — automated RDS snapshots
aws rds modify-db-instance \
  --db-instance-identifier production-db \
  --backup-retention-period 30 \
  --preferred-backup-window "03:00-04:00"

# Cross-region copy for DR
aws rds copy-db-snapshot \
  --source-db-snapshot-arn arn:aws:rds:us-east-1:123456789:snapshot:prod-snap \
  --target-db-snapshot-identifier prod-snap-dr \
  --source-region us-east-1 \
  --region us-west-2

# Verify backup integrity monthly
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier "dr-verify-$(date +%Y%m%d)" \
  --db-snapshot-identifier latest-production-snap \
  --region us-west-2
# Run integrity checks, then delete test instance

Object Storage Replication

# S3 cross-region replication
aws s3api put-bucket-replication \
  --bucket production-data \
  --replication-configuration '{
    "Role": "arn:aws:iam::role/s3-replication-role",
    "Rules": [{
      "Status": "Enabled",
      "Destination": {
        "Bucket": "arn:aws:s3:::production-data-dr-west",
        "StorageClass": "STANDARD_IA"
      }
    }]
  }'

Backup Strategy Matrix

Data Type	Method	Frequency	Retention	Cross-Region?
Production database	Snapshots + WAL	Continuous + daily	30 days	✅ Required
User uploads (S3)	Cross-region replication	Real-time	Indefinite	✅ Required
Configuration (IaC)	Git repository	Every commit	Indefinite	✅ GitHub/GitLab
Secrets (Vault)	Encrypted backup	Daily	90 days	✅ Required
Application logs	Log aggregator	Continuous	30-90 days	Optional
DNS configuration	Export + IaC	Every change	Git history	✅ Multiple providers

Step 3: Automate Failover

DNS-Based Failover (Route 53)

# Kubernetes multi-region failover with external-dns
apiVersion: externaldns.k8s.io/v1alpha1
kind: DNSEndpoint
metadata:
  name: api-failover
spec:
  endpoints:
    - dnsName: api.yourcompany.com
      recordType: A
      targets:
        - 10.1.0.50   # Primary (us-east-1)
      setIdentifier: primary
      providerSpecific:
        - name: aws/failover
          value: PRIMARY
        - name: aws/health-check-id
          value: "abc-123-health-check"
    - dnsName: api.yourcompany.com
      recordType: A
      targets:
        - 10.2.0.50   # Secondary (us-west-2)
      setIdentifier: secondary
      providerSpecific:
        - name: aws/failover
          value: SECONDARY

Critical: DNS TTL

TTL Setting	Failover Speed	Recommendation
60s (1 min)	Clients switch within 1 min	✅ For Tier 1 services
300s (5 min)	Clients switch within 5 min	✅ Default for production
3600s (1 hour)	Stuck for up to 1 hour	❌ Too slow for DR
86400s (24 hours)	Failover takes 24 hours	❌ Never for production

Step 4: Test Your DR Plan

Tabletop Exercise (Quarterly)

## DR Tabletop Exercise

**Scenario:** Primary database corrupted at 2 AM Saturday.

1. Who receives the first alert? (PagerDuty? Slack?)
2. What's the escalation path? (on-call → lead → VP)
3. How do we confirm it's real vs false positive?
4. What's the failover command? Who has access?
5. How long does failover take? (MEASURED, not estimated)
6. How do we verify DR is serving correctly?
7. What data was lost between last backup and failure?
8. How do we communicate to customers? Template ready?
9. How do we fail BACK to primary after recovery?
10. What's our post-mortem process?

**Scoring:**
- All 10 answered confidently: PASS
- 7-9 answered: CONDITIONAL PASS (fix gaps in 2 weeks)
- < 7 answered: FAIL (reschedule within 1 month)

Technical DR Test (Semi-Annual)

#!/bin/bash
set -e
echo "=== DR Test Started ==="
TIMESTAMP=$(date +%Y%m%d_%H%M%S)

# 1. Restore database in DR region
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier "dr-test-${TIMESTAMP}" \
  --db-snapshot-identifier "latest-cross-region-snap" \
  --region us-west-2

# 2. Wait for restoration
aws rds wait db-instance-available \
  --db-instance-identifier "dr-test-${TIMESTAMP}" \
  --region us-west-2

# 3. Validate data integrity
psql -h dr-test-endpoint -c \
  "SELECT COUNT(*) FROM orders WHERE order_date > NOW() - INTERVAL '1 day';"

# 4. Run smoke tests
curl -f https://dr-api.yourcompany.com/health || echo "FAIL: health check"

# 5. Measure recovery time
echo "Total recovery time: $(($SECONDS / 60)) minutes"

# 6. Cleanup
aws rds delete-db-instance \
  --db-instance-identifier "dr-test-${TIMESTAMP}" \
  --skip-final-snapshot --region us-west-2
echo "=== DR Test Complete ==="

Game Day (Annual)

Actually trigger a controlled failover in production:

Step	Action	Duration
1	Announce game day (30 min warning)	0:00
2	Simulate primary region failure	0:05
3	Measure alert-to-detection time	0:05-0:10
4	Execute failover runbook	0:10-0:25
5	Verify DR region serving correctly	0:25-0:35
6	Restore primary region	0:35-0:50
7	Fail back to primary	0:50-1:05
8	Post-game debrief	1:05-2:00

DR Cost Estimation

Strategy	Monthly Cost (% of prod)	RTO	RPO	Complexity
Backup & Restore	5-10%	4-24 hours	Hours	Low
Pilot Light	10-20%	1-4 hours	Minutes	Medium
Warm Standby	30-50%	15-60 min	Minutes	Medium-High
Active-Active	80-100%+	Near-zero	Zero	High

Cost Optimization Tips

Use cheaper instance types in DR region (upgrade during activation)
Store DR backups in S3 Standard-IA or Glacier (not Standard)
Use Spot instances for warm standby compute
Turn off read replicas in DR when not testing

Common DR Failures

Failure	Root Cause	Prevention
Backup exists, can’t restore	Never tested restoration	Monthly automated restore test
DNS failover takes hours	TTL set too high (3600s+)	Set TTL to 60-300s for production
DR environment is outdated	Schema drift between regions	Replicate schema changes in CI/CD
Nobody knows the runbook	Runbook not updated since launch	Review runbook quarterly
Failback causes second outage	Failback procedure never tested	Test failback in every DR drill
Data loss exceeds RPO	Backup frequency too low	Match backup frequency to RPO target

DR Strategy by RPO and RTO

Strategy	RPO	RTO	Cost (relative)	Best For
Backup and Restore	Hours	Hours-Days	$	Dev/test, non-critical internal tools
Pilot Light	Minutes	10-30 min	$$	Databases, core APIs
Warm Standby	Seconds-Minutes	Minutes	$$$	Business-critical applications
Multi-Site Active-Active	Zero	Zero	$$$$	Revenue-generating, SLA-bound

DR Testing Cadence

Test Type	Frequency	Duration	Who Is Involved
Tabletop exercise	Quarterly	2 hours	Engineering leads, SRE, management
Failover test (non-prod)	Monthly	4 hours	SRE team
Failover test (production)	Semi-annually	2-4 hours	Full engineering + ops team
Full DR drill	Annually	Full day	All teams + executive sponsor

DR Checklist

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For DR assessment consulting, visit garnetgrid.com. :::