A disaster recovery plan that hasn’t been tested isn’t a plan — it’s a theory. 37% of organizations fail their first real DR activation because they never practiced. The backup exists but nobody knows the restore procedure. The failover is configured but DNS TTL is 24 hours.
This guide covers how to define recovery objectives, implement backup and replication strategies, automate failover, calculate DR costs, and test your plan before you actually need it.
Step 1: Define RTO and RPO Targets
| Term | Definition | Business Question | Technical Implication |
|---|
| RTO (Recovery Time Objective) | Max acceptable downtime | ”How long can we be down?” | Determines failover strategy |
| RPO (Recovery Point Objective) | Max acceptable data loss | ”How much data can we lose?” | Determines backup frequency |
Tiers by Business Criticality
| Tier | Systems | RTO | RPO | Strategy | Cost (% of prod) |
|---|
| Tier 1: Mission Critical | Payment processing, auth, core API | < 15 min | 0 (zero data loss) | Active-Active / Hot Standby | 80-100% |
| Tier 2: Business Critical | CRM, ERP, dashboards | < 1 hour | < 15 min | Warm Standby | 30-50% |
| Tier 3: Essential | Email, file shares, internal tools | < 4 hours | < 1 hour | Pilot Light | 10-20% |
| Tier 4: Non-Critical | Dev environments, archives | < 24 hours | < 24 hours | Backup & Restore | 5-10% |
How to Classify Systems
Does the system directly generate revenue?
├── YES → Tier 1 (payment, checkout, core product)
└── NO → Continue
Does the system affect customer experience?
├── YES → Tier 2 (CRM, support tools, dashboards)
└── NO → Continue
Do employees depend on it daily?
├── YES → Tier 3 (email, file storage, internal apps)
└── NO → Tier 4 (dev tools, archives)
Step 2: Implement Backup Strategy
Database Backups
# AWS — automated RDS snapshots
aws rds modify-db-instance \
--db-instance-identifier production-db \
--backup-retention-period 30 \
--preferred-backup-window "03:00-04:00"
# Cross-region copy for DR
aws rds copy-db-snapshot \
--source-db-snapshot-arn arn:aws:rds:us-east-1:123456789:snapshot:prod-snap \
--target-db-snapshot-identifier prod-snap-dr \
--source-region us-east-1 \
--region us-west-2
# Verify backup integrity monthly
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier "dr-verify-$(date +%Y%m%d)" \
--db-snapshot-identifier latest-production-snap \
--region us-west-2
# Run integrity checks, then delete test instance
Object Storage Replication
# S3 cross-region replication
aws s3api put-bucket-replication \
--bucket production-data \
--replication-configuration '{
"Role": "arn:aws:iam::role/s3-replication-role",
"Rules": [{
"Status": "Enabled",
"Destination": {
"Bucket": "arn:aws:s3:::production-data-dr-west",
"StorageClass": "STANDARD_IA"
}
}]
}'
Backup Strategy Matrix
| Data Type | Method | Frequency | Retention | Cross-Region? |
|---|
| Production database | Snapshots + WAL | Continuous + daily | 30 days | ✅ Required |
| User uploads (S3) | Cross-region replication | Real-time | Indefinite | ✅ Required |
| Configuration (IaC) | Git repository | Every commit | Indefinite | ✅ GitHub/GitLab |
| Secrets (Vault) | Encrypted backup | Daily | 90 days | ✅ Required |
| Application logs | Log aggregator | Continuous | 30-90 days | Optional |
| DNS configuration | Export + IaC | Every change | Git history | ✅ Multiple providers |
Step 3: Automate Failover
DNS-Based Failover (Route 53)
# Kubernetes multi-region failover with external-dns
apiVersion: externaldns.k8s.io/v1alpha1
kind: DNSEndpoint
metadata:
name: api-failover
spec:
endpoints:
- dnsName: api.yourcompany.com
recordType: A
targets:
- 10.1.0.50 # Primary (us-east-1)
setIdentifier: primary
providerSpecific:
- name: aws/failover
value: PRIMARY
- name: aws/health-check-id
value: "abc-123-health-check"
- dnsName: api.yourcompany.com
recordType: A
targets:
- 10.2.0.50 # Secondary (us-west-2)
setIdentifier: secondary
providerSpecific:
- name: aws/failover
value: SECONDARY
Critical: DNS TTL
| TTL Setting | Failover Speed | Recommendation |
|---|
| 60s (1 min) | Clients switch within 1 min | ✅ For Tier 1 services |
| 300s (5 min) | Clients switch within 5 min | ✅ Default for production |
| 3600s (1 hour) | Stuck for up to 1 hour | ❌ Too slow for DR |
| 86400s (24 hours) | Failover takes 24 hours | ❌ Never for production |
Step 4: Test Your DR Plan
Tabletop Exercise (Quarterly)
## DR Tabletop Exercise
**Scenario:** Primary database corrupted at 2 AM Saturday.
1. Who receives the first alert? (PagerDuty? Slack?)
2. What's the escalation path? (on-call → lead → VP)
3. How do we confirm it's real vs false positive?
4. What's the failover command? Who has access?
5. How long does failover take? (MEASURED, not estimated)
6. How do we verify DR is serving correctly?
7. What data was lost between last backup and failure?
8. How do we communicate to customers? Template ready?
9. How do we fail BACK to primary after recovery?
10. What's our post-mortem process?
**Scoring:**
- All 10 answered confidently: PASS
- 7-9 answered: CONDITIONAL PASS (fix gaps in 2 weeks)
- < 7 answered: FAIL (reschedule within 1 month)
Technical DR Test (Semi-Annual)
#!/bin/bash
set -e
echo "=== DR Test Started ==="
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
# 1. Restore database in DR region
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier "dr-test-${TIMESTAMP}" \
--db-snapshot-identifier "latest-cross-region-snap" \
--region us-west-2
# 2. Wait for restoration
aws rds wait db-instance-available \
--db-instance-identifier "dr-test-${TIMESTAMP}" \
--region us-west-2
# 3. Validate data integrity
psql -h dr-test-endpoint -c \
"SELECT COUNT(*) FROM orders WHERE order_date > NOW() - INTERVAL '1 day';"
# 4. Run smoke tests
curl -f https://dr-api.yourcompany.com/health || echo "FAIL: health check"
# 5. Measure recovery time
echo "Total recovery time: $(($SECONDS / 60)) minutes"
# 6. Cleanup
aws rds delete-db-instance \
--db-instance-identifier "dr-test-${TIMESTAMP}" \
--skip-final-snapshot --region us-west-2
echo "=== DR Test Complete ==="
Game Day (Annual)
Actually trigger a controlled failover in production:
| Step | Action | Duration |
|---|
| 1 | Announce game day (30 min warning) | 0:00 |
| 2 | Simulate primary region failure | 0:05 |
| 3 | Measure alert-to-detection time | 0:05-0:10 |
| 4 | Execute failover runbook | 0:10-0:25 |
| 5 | Verify DR region serving correctly | 0:25-0:35 |
| 6 | Restore primary region | 0:35-0:50 |
| 7 | Fail back to primary | 0:50-1:05 |
| 8 | Post-game debrief | 1:05-2:00 |
DR Cost Estimation
| Strategy | Monthly Cost (% of prod) | RTO | RPO | Complexity |
|---|
| Backup & Restore | 5-10% | 4-24 hours | Hours | Low |
| Pilot Light | 10-20% | 1-4 hours | Minutes | Medium |
| Warm Standby | 30-50% | 15-60 min | Minutes | Medium-High |
| Active-Active | 80-100%+ | Near-zero | Zero | High |
Cost Optimization Tips
- Use cheaper instance types in DR region (upgrade during activation)
- Store DR backups in S3 Standard-IA or Glacier (not Standard)
- Use Spot instances for warm standby compute
- Turn off read replicas in DR when not testing
Common DR Failures
| Failure | Root Cause | Prevention |
|---|
| Backup exists, can’t restore | Never tested restoration | Monthly automated restore test |
| DNS failover takes hours | TTL set too high (3600s+) | Set TTL to 60-300s for production |
| DR environment is outdated | Schema drift between regions | Replicate schema changes in CI/CD |
| Nobody knows the runbook | Runbook not updated since launch | Review runbook quarterly |
| Failback causes second outage | Failback procedure never tested | Test failback in every DR drill |
| Data loss exceeds RPO | Backup frequency too low | Match backup frequency to RPO target |
DR Strategy by RPO and RTO
| Strategy | RPO | RTO | Cost (relative) | Best For |
|---|
| Backup and Restore | Hours | Hours-Days | $ | Dev/test, non-critical internal tools |
| Pilot Light | Minutes | 10-30 min | $$ | Databases, core APIs |
| Warm Standby | Seconds-Minutes | Minutes | $$$ | Business-critical applications |
| Multi-Site Active-Active | Zero | Zero | $$$$ | Revenue-generating, SLA-bound |
DR Testing Cadence
| Test Type | Frequency | Duration | Who Is Involved |
|---|
| Tabletop exercise | Quarterly | 2 hours | Engineering leads, SRE, management |
| Failover test (non-prod) | Monthly | 4 hours | SRE team |
| Failover test (production) | Semi-annually | 2-4 hours | Full engineering + ops team |
| Full DR drill | Annually | Full day | All teams + executive sponsor |
DR Checklist
:::note[Source]
This guide is derived from operational intelligence at Garnet Grid Consulting. For DR assessment consulting, visit garnetgrid.com.
:::