Disaster Recovery & Business Continuity
Design cloud-native disaster recovery. Covers RPO/RTO planning, DR strategies (backup, pilot light, warm standby, active-active), automated failover, testing frameworks, and compliance requirements.
Disaster recovery is the practice most organizations get wrong because they only think about it after a disaster. The reality: your DR strategy determines whether a region outage means 5 minutes of failover or 5 days of scrambling. This guide covers how to design, implement, and test DR architectures that actually work when you need them.
RPO and RTO
| Metric | Definition | Question It Answers |
|---|---|---|
| RPO (Recovery Point Objective) | Maximum acceptable data loss | ”How much data can we afford to lose?” |
| RTO (Recovery Time Objective) | Maximum acceptable downtime | ”How long until we’re back online?” |
RPO/RTO Tiers
| Tier | RPO | RTO | Cost | Strategy |
|---|---|---|---|---|
| Tier 1 (Mission Critical) | 0 (no data loss) | < 15 min | $$$$ | Active-Active multi-region |
| Tier 2 (Business Critical) | < 1 hour | < 1 hour | $$$ | Warm Standby |
| Tier 3 (Important) | < 4 hours | < 4 hours | $$ | Pilot Light |
| Tier 4 (Standard) | < 24 hours | < 24 hours | $ | Backup & Restore |
DR Strategies
Strategy 1: Backup & Restore
Primary Region DR Region
┌──────────────┐ ┌──────────────┐
│ Active │ Backup copies │ Cold │
│ Workloads │ ───────────────▶│ (nothing │
│ │ (daily/hourly) │ running) │
└──────────────┘ └──────────────┘
↓ On disaster
Restore from
backups (hours)
RPO: Hours to days | RTO: Hours to days | Cost: Lowest
Strategy 2: Pilot Light
Primary Region DR Region
┌──────────────┐ ┌──────────────┐
│ Active │ Real-time │ Core services │
│ Workloads │ replication │ only (DB │
│ │ ───────────────▶│ replicas) │
│ │ │ No compute │
└──────────────┘ └──────────────┘
↓ On disaster
Scale up compute,
switch DNS (1hr)
RPO: Minutes | RTO: 30 min - 2 hours | Cost: Low-Medium
Strategy 3: Warm Standby
Primary Region DR Region
┌──────────────┐ ┌──────────────┐
│ Active │ Continuous │ Scaled-down │
│ Workloads │ replication │ copy of full │
│ (full scale) │ ───────────────▶│ environment │
│ │ │ (25% capacity)│
└──────────────┘ └──────────────┘
↓ On disaster
Scale up, switch
DNS (15-30 min)
RPO: Seconds to minutes | RTO: 15-30 minutes | Cost: Medium
Strategy 4: Active-Active
Region A (Active) Region B (Active)
┌──────────────┐ ┌──────────────┐
│ Full │ Bi-directional │ Full │
│ Workloads │ replication │ Workloads │
│ (50% traffic)│ ◀─────────────▶│ (50% traffic) │
└──────────────┘ └──────────────┘
↑ ↑
└───── Global Load Balancer ────┘
On failure: 100% traffic to surviving region
RPO: 0 (no data loss) | RTO: Seconds | Cost: Highest
Automated Failover Architecture
def check_primary_health():
checks = {
"api_health": check_endpoint("https://api.primary.example.com/health"),
"database_replication": check_replication_lag(max_lag_seconds=30),
"error_rate": get_error_rate(threshold=0.05),
}
failed = [name for name, passed in checks.items() if not passed]
if len(failed) >= 2:
return {"healthy": False, "failed_checks": failed}
return {"healthy": True}
def execute_failover():
"""Automated failover sequence."""
steps = [
("Verify primary is truly down", verify_primary_down),
("Promote DR database replica", promote_db_replica),
("Scale up DR compute", scale_dr_compute),
("Update DNS / traffic routing", switch_dns),
("Verify DR health", verify_dr_health),
("Notify operations team", send_pagerduty_alert),
]
for step_name, step_fn in steps:
result = step_fn()
if not result.success:
# Halt failover and alert humans
alert(f"Failover failed at: {step_name}")
return False
return True
DR Testing
Testing Schedule
| Test Type | Frequency | Scope | Duration |
|---|---|---|---|
| Tabletop exercise | Quarterly | Discuss DR plan as a team | 2 hours |
| Backup restoration test | Monthly | Restore one database from backup | 1-2 hours |
| Component failover | Monthly | Fail one component, verify recovery | 1 hour |
| Full DR test | Bi-annually | Complete failover to DR region | 4-8 hours |
| Chaos engineering | Continuous | Random failure injection | Ongoing |
DR Cost Optimization
| Optimization | Impact | Risk |
|---|---|---|
| Use spot/preemptible for DR compute | 60-80% savings | Longer scale-up time |
| Cross-region replication only for critical data | 40-60% savings | Some data has higher RPO |
| Scheduled DR environments (off during nights) | 50% savings | DR not instant during off-hours |
| Multi-purpose DR region (also runs batch jobs) | 30% savings | Resource contention during failover |
Anti-Patterns
| Anti-Pattern | Problem | Fix |
|---|---|---|
| Untested DR plan | Plan exists on paper but doesn’t work | Test quarterly, document results |
| Same-region “DR” | Region outage takes down both primary and DR | DR must be in a different region |
| Manual failover only | Takes hours when on-call engineer is sleeping | Automated failover with human verification |
| No RPO/RTO targets | Nobody knows what “fast enough” means | Define RPO/RTO per application tier |
| DR as afterthought | DR architecture bolted on after production launch | Design DR into architecture from the start |
| No data consistency verification | DR data might be corrupt or stale | Continuous replication monitoring + integrity checks |
Checklist
- RPO and RTO defined for each application tier
- DR strategy selected per tier (backup/pilot light/warm/active-active)
- DR region provisioned with infrastructure-as-code
- Database replication configured and monitored
- Failover procedure documented and automated
- DNS/traffic management configured for failover
- DR testing schedule established (quarterly minimum)
- Last full DR test completed within 6 months
- Runbook: step-by-step failover and failback procedures
- Communication plan: who to notify, when, how
- Compliance: DR meets regulatory requirements
- Cost tracking: DR infrastructure costs monitored
:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For DR planning consulting, visit garnetgrid.com. :::