Verified by Garnet Grid

How to Build a Disaster Recovery Plan

Design and test disaster recovery for cloud and on-prem workloads. Covers RTO/RPO targets, backup strategies, failover automation, and tabletop exercises.

A disaster recovery plan that hasn’t been tested isn’t a plan — it’s a theory. 37% of organizations fail their first real DR activation because they never practiced. The backup exists but nobody knows the restore procedure. The failover is configured but DNS TTL is 24 hours.

This guide covers how to define recovery objectives, implement backup and replication strategies, automate failover, calculate DR costs, and test your plan before you actually need it.


Step 1: Define RTO and RPO Targets

TermDefinitionBusiness QuestionTechnical Implication
RTO (Recovery Time Objective)Max acceptable downtime”How long can we be down?”Determines failover strategy
RPO (Recovery Point Objective)Max acceptable data loss”How much data can we lose?”Determines backup frequency

Tiers by Business Criticality

TierSystemsRTORPOStrategyCost (% of prod)
Tier 1: Mission CriticalPayment processing, auth, core API< 15 min0 (zero data loss)Active-Active / Hot Standby80-100%
Tier 2: Business CriticalCRM, ERP, dashboards< 1 hour< 15 minWarm Standby30-50%
Tier 3: EssentialEmail, file shares, internal tools< 4 hours< 1 hourPilot Light10-20%
Tier 4: Non-CriticalDev environments, archives< 24 hours< 24 hoursBackup & Restore5-10%

How to Classify Systems

Does the system directly generate revenue?
├── YES → Tier 1 (payment, checkout, core product)
└── NO → Continue

Does the system affect customer experience?
├── YES → Tier 2 (CRM, support tools, dashboards)
└── NO → Continue

Do employees depend on it daily?
├── YES → Tier 3 (email, file storage, internal apps)
└── NO → Tier 4 (dev tools, archives)

Step 2: Implement Backup Strategy

Database Backups

# AWS — automated RDS snapshots
aws rds modify-db-instance \
  --db-instance-identifier production-db \
  --backup-retention-period 30 \
  --preferred-backup-window "03:00-04:00"

# Cross-region copy for DR
aws rds copy-db-snapshot \
  --source-db-snapshot-arn arn:aws:rds:us-east-1:123456789:snapshot:prod-snap \
  --target-db-snapshot-identifier prod-snap-dr \
  --source-region us-east-1 \
  --region us-west-2

# Verify backup integrity monthly
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier "dr-verify-$(date +%Y%m%d)" \
  --db-snapshot-identifier latest-production-snap \
  --region us-west-2
# Run integrity checks, then delete test instance

Object Storage Replication

# S3 cross-region replication
aws s3api put-bucket-replication \
  --bucket production-data \
  --replication-configuration '{
    "Role": "arn:aws:iam::role/s3-replication-role",
    "Rules": [{
      "Status": "Enabled",
      "Destination": {
        "Bucket": "arn:aws:s3:::production-data-dr-west",
        "StorageClass": "STANDARD_IA"
      }
    }]
  }'

Backup Strategy Matrix

Data TypeMethodFrequencyRetentionCross-Region?
Production databaseSnapshots + WALContinuous + daily30 days✅ Required
User uploads (S3)Cross-region replicationReal-timeIndefinite✅ Required
Configuration (IaC)Git repositoryEvery commitIndefinite✅ GitHub/GitLab
Secrets (Vault)Encrypted backupDaily90 days✅ Required
Application logsLog aggregatorContinuous30-90 daysOptional
DNS configurationExport + IaCEvery changeGit history✅ Multiple providers

Step 3: Automate Failover

DNS-Based Failover (Route 53)

# Kubernetes multi-region failover with external-dns
apiVersion: externaldns.k8s.io/v1alpha1
kind: DNSEndpoint
metadata:
  name: api-failover
spec:
  endpoints:
    - dnsName: api.yourcompany.com
      recordType: A
      targets:
        - 10.1.0.50   # Primary (us-east-1)
      setIdentifier: primary
      providerSpecific:
        - name: aws/failover
          value: PRIMARY
        - name: aws/health-check-id
          value: "abc-123-health-check"
    - dnsName: api.yourcompany.com
      recordType: A
      targets:
        - 10.2.0.50   # Secondary (us-west-2)
      setIdentifier: secondary
      providerSpecific:
        - name: aws/failover
          value: SECONDARY

Critical: DNS TTL

TTL SettingFailover SpeedRecommendation
60s (1 min)Clients switch within 1 min✅ For Tier 1 services
300s (5 min)Clients switch within 5 min✅ Default for production
3600s (1 hour)Stuck for up to 1 hour❌ Too slow for DR
86400s (24 hours)Failover takes 24 hours❌ Never for production

Step 4: Test Your DR Plan

Tabletop Exercise (Quarterly)

## DR Tabletop Exercise

**Scenario:** Primary database corrupted at 2 AM Saturday.

1. Who receives the first alert? (PagerDuty? Slack?)
2. What's the escalation path? (on-call → lead → VP)
3. How do we confirm it's real vs false positive?
4. What's the failover command? Who has access?
5. How long does failover take? (MEASURED, not estimated)
6. How do we verify DR is serving correctly?
7. What data was lost between last backup and failure?
8. How do we communicate to customers? Template ready?
9. How do we fail BACK to primary after recovery?
10. What's our post-mortem process?

**Scoring:**
- All 10 answered confidently: PASS
- 7-9 answered: CONDITIONAL PASS (fix gaps in 2 weeks)
- < 7 answered: FAIL (reschedule within 1 month)

Technical DR Test (Semi-Annual)

#!/bin/bash
set -e
echo "=== DR Test Started ==="
TIMESTAMP=$(date +%Y%m%d_%H%M%S)

# 1. Restore database in DR region
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier "dr-test-${TIMESTAMP}" \
  --db-snapshot-identifier "latest-cross-region-snap" \
  --region us-west-2

# 2. Wait for restoration
aws rds wait db-instance-available \
  --db-instance-identifier "dr-test-${TIMESTAMP}" \
  --region us-west-2

# 3. Validate data integrity
psql -h dr-test-endpoint -c \
  "SELECT COUNT(*) FROM orders WHERE order_date > NOW() - INTERVAL '1 day';"

# 4. Run smoke tests
curl -f https://dr-api.yourcompany.com/health || echo "FAIL: health check"

# 5. Measure recovery time
echo "Total recovery time: $(($SECONDS / 60)) minutes"

# 6. Cleanup
aws rds delete-db-instance \
  --db-instance-identifier "dr-test-${TIMESTAMP}" \
  --skip-final-snapshot --region us-west-2
echo "=== DR Test Complete ==="

Game Day (Annual)

Actually trigger a controlled failover in production:

StepActionDuration
1Announce game day (30 min warning)0:00
2Simulate primary region failure0:05
3Measure alert-to-detection time0:05-0:10
4Execute failover runbook0:10-0:25
5Verify DR region serving correctly0:25-0:35
6Restore primary region0:35-0:50
7Fail back to primary0:50-1:05
8Post-game debrief1:05-2:00

DR Cost Estimation

StrategyMonthly Cost (% of prod)RTORPOComplexity
Backup & Restore5-10%4-24 hoursHoursLow
Pilot Light10-20%1-4 hoursMinutesMedium
Warm Standby30-50%15-60 minMinutesMedium-High
Active-Active80-100%+Near-zeroZeroHigh

Cost Optimization Tips

  • Use cheaper instance types in DR region (upgrade during activation)
  • Store DR backups in S3 Standard-IA or Glacier (not Standard)
  • Use Spot instances for warm standby compute
  • Turn off read replicas in DR when not testing

Common DR Failures

FailureRoot CausePrevention
Backup exists, can’t restoreNever tested restorationMonthly automated restore test
DNS failover takes hoursTTL set too high (3600s+)Set TTL to 60-300s for production
DR environment is outdatedSchema drift between regionsReplicate schema changes in CI/CD
Nobody knows the runbookRunbook not updated since launchReview runbook quarterly
Failback causes second outageFailback procedure never testedTest failback in every DR drill
Data loss exceeds RPOBackup frequency too lowMatch backup frequency to RPO target

DR Strategy by RPO and RTO

StrategyRPORTOCost (relative)Best For
Backup and RestoreHoursHours-Days$Dev/test, non-critical internal tools
Pilot LightMinutes10-30 min$$Databases, core APIs
Warm StandbySeconds-MinutesMinutes$$$Business-critical applications
Multi-Site Active-ActiveZeroZero$$$$Revenue-generating, SLA-bound

DR Testing Cadence

Test TypeFrequencyDurationWho Is Involved
Tabletop exerciseQuarterly2 hoursEngineering leads, SRE, management
Failover test (non-prod)Monthly4 hoursSRE team
Failover test (production)Semi-annually2-4 hoursFull engineering + ops team
Full DR drillAnnuallyFull dayAll teams + executive sponsor

DR Checklist

  • RTO/RPO defined for every Tier 1 & 2 system
  • Systems classified by business criticality (Tiers 1-4)
  • Automated backups with cross-region replication
  • Database backup restoration tested monthly
  • DNS failover configured with TTL < 300s for Tier 1
  • Health checks trigger automatic failover
  • DR runbook with step-by-step commands (not prose)
  • On-call rotation includes DR training
  • Tabletop exercise quarterly (with scoring)
  • Technical DR test semi-annually
  • Game day (production failover) annually
  • Customer communication templates prepared
  • Failback procedure documented and tested
  • DR costs budgeted and reviewed quarterly

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For DR assessment consulting, visit garnetgrid.com. :::

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →