Cloud Disaster Recovery Architecture

Disaster recovery (DR) answers a question the business must decide: how much data loss and downtime can we tolerate? The answer determines the architecture and the cost. A zero-data-loss, zero-downtime DR strategy costs 2x your production infrastructure. A 24-hour recovery strategy costs almost nothing extra. Most organizations need something in between.

RPO and RTO

RPO (Recovery Point Objective):
  "How much data can we afford to lose?"
  
  RPO = 0:      No data loss → synchronous replication
  RPO = 1 hour: Lose up to 1 hour of data → async replication
  RPO = 24 hours: Lose up to 1 day → daily backups
  
RTO (Recovery Time Objective):
  "How quickly must we be back online?"
  
  RTO = 0:       No downtime → active-active multi-region
  RTO = 15 min:  Quick failover → warm standby
  RTO = 1 hour:  Manual failover → pilot light
  RTO = 24 hours: Full rebuild → backup & restore

Cost vs. Recovery:

  Strategy          RPO        RTO        Cost
  ─────────────────────────────────────────────
  Backup/Restore    24 hours   24 hours   $ (cheapest)
  Pilot Light       1 hour     1 hour     $$
  Warm Standby      minutes    15 min     $$$
  Active-Active     0          0          $$$$ (most expensive)

DR Architectures

Pilot Light:
  Production: us-east-1 (full infrastructure)
  DR: us-west-2 (database replica + minimal compute)
  
  Normal: DR region runs database replication only
  Failover: Scale up compute in DR region, switch DNS
  RTO: 30-60 minutes
  Cost: ~15% of production cost
  
  ┌──────────────┐          ┌──────────────┐
  │  us-east-1   │  async   │  us-west-2   │
  │              │  repl    │              │
  │  App (full)  │─────────►│  DB replica  │
  │  DB (primary)│          │  (no compute)│
  └──────────────┘          └──────────────┘

Warm Standby:
  Production: us-east-1 (full infrastructure)
  DR: us-west-2 (scaled-down copy of everything)
  
  Normal: DR runs at 20% capacity
  Failover: Scale up DR, switch traffic
  RTO: 5-15 minutes
  Cost: ~30% of production cost
  
  ┌──────────────┐          ┌──────────────┐
  │  us-east-1   │  async   │  us-west-2   │
  │              │  repl    │              │
  │  App (100%)  │─────────►│  App (20%)   │
  │  DB (primary)│          │  DB (replica)│
  └──────────────┘          └──────────────┘

Active-Active:
  Both regions serve traffic simultaneously
  
  RTO: 0 (automatic)
  RPO: 0 (synchronous or conflict resolution)
  Cost: ~100% additional (2x production cost)
  
  ┌──────────────┐   sync   ┌──────────────┐
  │  us-east-1   │◄────────►│  us-west-2   │
  │              │          │              │
  │  App (100%)  │          │  App (100%)  │
  │  DB (multi-  │          │  DB (multi-  │
  │    primary)  │          │    primary)  │
  └──────────────┘          └──────────────┘

Anti-Patterns

Anti-Pattern	Consequence	Fix
No DR testing	Discover DR doesn’t work during actual disaster	Quarterly DR drills with full failover
DR plan without runbooks	Panic during disaster, slow recovery	Step-by-step runbooks tested quarterly
Backup but never test restore	Corrupt or incomplete backups discovered too late	Monthly restore tests from backup
Same region for DR	Regional outage takes out both	Cross-region or cross-cloud DR
Manual failover only	Depends on engineer availability at 3 AM	Automated failover with manual approval option

Disaster recovery is insurance. You pay for it hoping you never need it. But unlike insurance, DR requires regular testing to ensure it actually works when disaster strikes. The worst time to discover your DR plan has a gap is during an actual disaster.

RPO and RTO

DR Architectures

Anti-Patterns

More in Cloud Engineering

Azure Container Registry Security Scanning

Cloud Governance Frameworks

CDN Architecture & Edge Caching