Disaster Recovery & Business Continuity

Disaster recovery is the practice most organizations get wrong because they only think about it after a disaster. The reality: your DR strategy determines whether a region outage means 5 minutes of failover or 5 days of scrambling. This guide covers how to design, implement, and test DR architectures that actually work when you need them.

RPO and RTO

Metric	Definition	Question It Answers
RPO (Recovery Point Objective)	Maximum acceptable data loss	”How much data can we afford to lose?”
RTO (Recovery Time Objective)	Maximum acceptable downtime	”How long until we’re back online?”

RPO/RTO Tiers

Tier	RPO	RTO	Cost	Strategy
Tier 1 (Mission Critical)	0 (no data loss)	< 15 min	$$$$	Active-Active multi-region
Tier 2 (Business Critical)	< 1 hour	< 1 hour	$$$	Warm Standby
Tier 3 (Important)	< 4 hours	< 4 hours	$$	Pilot Light
Tier 4 (Standard)	< 24 hours	< 24 hours	$	Backup & Restore

DR Strategies

Strategy 1: Backup & Restore

Primary Region                    DR Region
┌──────────────┐                 ┌──────────────┐
│ Active       │  Backup copies  │ Cold          │
│ Workloads    │ ───────────────▶│ (nothing      │
│              │  (daily/hourly) │  running)     │
└──────────────┘                 └──────────────┘
                                    ↓ On disaster
                                 Restore from
                                 backups (hours)

RPO: Hours to days | RTO: Hours to days | Cost: Lowest

Strategy 2: Pilot Light

Primary Region                    DR Region
┌──────────────┐                 ┌──────────────┐
│ Active       │  Real-time      │ Core services │
│ Workloads    │  replication    │ only (DB      │
│              │ ───────────────▶│ replicas)     │
│              │                 │ No compute    │
└──────────────┘                 └──────────────┘
                                    ↓ On disaster
                                 Scale up compute,
                                 switch DNS (1hr)

RPO: Minutes | RTO: 30 min - 2 hours | Cost: Low-Medium

Strategy 3: Warm Standby

Primary Region                    DR Region
┌──────────────┐                 ┌──────────────┐
│ Active       │  Continuous     │ Scaled-down   │
│ Workloads    │  replication    │ copy of full  │
│ (full scale) │ ───────────────▶│ environment   │
│              │                 │ (25% capacity)│
└──────────────┘                 └──────────────┘
                                    ↓ On disaster
                                 Scale up, switch
                                 DNS (15-30 min)

RPO: Seconds to minutes | RTO: 15-30 minutes | Cost: Medium

Strategy 4: Active-Active

Region A (Active)                 Region B (Active)
┌──────────────┐                 ┌──────────────┐
│ Full         │  Bi-directional │ Full          │
│ Workloads    │  replication    │ Workloads     │
│ (50% traffic)│ ◀─────────────▶│ (50% traffic) │
└──────────────┘                 └──────────────┘
        ↑                               ↑
        └───── Global Load Balancer ────┘

On failure: 100% traffic to surviving region

RPO: 0 (no data loss) | RTO: Seconds | Cost: Highest

Automated Failover Architecture

def check_primary_health():
    checks = {
        "api_health": check_endpoint("https://api.primary.example.com/health"),
        "database_replication": check_replication_lag(max_lag_seconds=30),
        "error_rate": get_error_rate(threshold=0.05),
    }
    
    failed = [name for name, passed in checks.items() if not passed]
    
    if len(failed) >= 2:
        return {"healthy": False, "failed_checks": failed}
    return {"healthy": True}

def execute_failover():
    """Automated failover sequence."""
    
    steps = [
        ("Verify primary is truly down", verify_primary_down),
        ("Promote DR database replica", promote_db_replica),
        ("Scale up DR compute", scale_dr_compute),
        ("Update DNS / traffic routing", switch_dns),
        ("Verify DR health", verify_dr_health),
        ("Notify operations team", send_pagerduty_alert),
    ]
    
    for step_name, step_fn in steps:
        result = step_fn()
        if not result.success:
            # Halt failover and alert humans
            alert(f"Failover failed at: {step_name}")
            return False
    
    return True

DR Testing

Testing Schedule

Test Type	Frequency	Scope	Duration
Tabletop exercise	Quarterly	Discuss DR plan as a team	2 hours
Backup restoration test	Monthly	Restore one database from backup	1-2 hours
Component failover	Monthly	Fail one component, verify recovery	1 hour
Full DR test	Bi-annually	Complete failover to DR region	4-8 hours
Chaos engineering	Continuous	Random failure injection	Ongoing

DR Cost Optimization

Optimization	Impact	Risk
Use spot/preemptible for DR compute	60-80% savings	Longer scale-up time
Cross-region replication only for critical data	40-60% savings	Some data has higher RPO
Scheduled DR environments (off during nights)	50% savings	DR not instant during off-hours
Multi-purpose DR region (also runs batch jobs)	30% savings	Resource contention during failover

Anti-Patterns

Anti-Pattern	Problem	Fix
Untested DR plan	Plan exists on paper but doesn’t work	Test quarterly, document results
Same-region “DR”	Region outage takes down both primary and DR	DR must be in a different region
Manual failover only	Takes hours when on-call engineer is sleeping	Automated failover with human verification
No RPO/RTO targets	Nobody knows what “fast enough” means	Define RPO/RTO per application tier
DR as afterthought	DR architecture bolted on after production launch	Design DR into architecture from the start
No data consistency verification	DR data might be corrupt or stale	Continuous replication monitoring + integrity checks

Checklist

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For DR planning consulting, visit garnetgrid.com. :::