Cloud Disaster Recovery Architecture
Design disaster recovery strategies that balance cost and recovery speed. Covers RPO and RTO definitions, DR tiers, pilot light and warm standby architectures, failover automation, and the patterns that keep businesses running when entire regions go down.
Disaster recovery (DR) answers a question the business must decide: how much data loss and downtime can we tolerate? The answer determines the architecture and the cost. A zero-data-loss, zero-downtime DR strategy costs 2x your production infrastructure. A 24-hour recovery strategy costs almost nothing extra. Most organizations need something in between.
RPO and RTO
RPO (Recovery Point Objective):
"How much data can we afford to lose?"
RPO = 0: No data loss → synchronous replication
RPO = 1 hour: Lose up to 1 hour of data → async replication
RPO = 24 hours: Lose up to 1 day → daily backups
RTO (Recovery Time Objective):
"How quickly must we be back online?"
RTO = 0: No downtime → active-active multi-region
RTO = 15 min: Quick failover → warm standby
RTO = 1 hour: Manual failover → pilot light
RTO = 24 hours: Full rebuild → backup & restore
Cost vs. Recovery:
Strategy RPO RTO Cost
─────────────────────────────────────────────
Backup/Restore 24 hours 24 hours $ (cheapest)
Pilot Light 1 hour 1 hour $$
Warm Standby minutes 15 min $$$
Active-Active 0 0 $$$$ (most expensive)
DR Architectures
Pilot Light:
Production: us-east-1 (full infrastructure)
DR: us-west-2 (database replica + minimal compute)
Normal: DR region runs database replication only
Failover: Scale up compute in DR region, switch DNS
RTO: 30-60 minutes
Cost: ~15% of production cost
┌──────────────┐ ┌──────────────┐
│ us-east-1 │ async │ us-west-2 │
│ │ repl │ │
│ App (full) │─────────►│ DB replica │
│ DB (primary)│ │ (no compute)│
└──────────────┘ └──────────────┘
Warm Standby:
Production: us-east-1 (full infrastructure)
DR: us-west-2 (scaled-down copy of everything)
Normal: DR runs at 20% capacity
Failover: Scale up DR, switch traffic
RTO: 5-15 minutes
Cost: ~30% of production cost
┌──────────────┐ ┌──────────────┐
│ us-east-1 │ async │ us-west-2 │
│ │ repl │ │
│ App (100%) │─────────►│ App (20%) │
│ DB (primary)│ │ DB (replica)│
└──────────────┘ └──────────────┘
Active-Active:
Both regions serve traffic simultaneously
RTO: 0 (automatic)
RPO: 0 (synchronous or conflict resolution)
Cost: ~100% additional (2x production cost)
┌──────────────┐ sync ┌──────────────┐
│ us-east-1 │◄────────►│ us-west-2 │
│ │ │ │
│ App (100%) │ │ App (100%) │
│ DB (multi- │ │ DB (multi- │
│ primary) │ │ primary) │
└──────────────┘ └──────────────┘
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| No DR testing | Discover DR doesn’t work during actual disaster | Quarterly DR drills with full failover |
| DR plan without runbooks | Panic during disaster, slow recovery | Step-by-step runbooks tested quarterly |
| Backup but never test restore | Corrupt or incomplete backups discovered too late | Monthly restore tests from backup |
| Same region for DR | Regional outage takes out both | Cross-region or cross-cloud DR |
| Manual failover only | Depends on engineer availability at 3 AM | Automated failover with manual approval option |
Disaster recovery is insurance. You pay for it hoping you never need it. But unlike insurance, DR requires regular testing to ensure it actually works when disaster strikes. The worst time to discover your DR plan has a gap is during an actual disaster.