ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

Cloud Disaster Recovery Architecture

Design disaster recovery strategies that balance cost and recovery speed. Covers RPO and RTO definitions, DR tiers, pilot light and warm standby architectures, failover automation, and the patterns that keep businesses running when entire regions go down.

Disaster recovery (DR) answers a question the business must decide: how much data loss and downtime can we tolerate? The answer determines the architecture and the cost. A zero-data-loss, zero-downtime DR strategy costs 2x your production infrastructure. A 24-hour recovery strategy costs almost nothing extra. Most organizations need something in between.


RPO and RTO

RPO (Recovery Point Objective):
  "How much data can we afford to lose?"
  
  RPO = 0:      No data loss → synchronous replication
  RPO = 1 hour: Lose up to 1 hour of data → async replication
  RPO = 24 hours: Lose up to 1 day → daily backups
  
RTO (Recovery Time Objective):
  "How quickly must we be back online?"
  
  RTO = 0:       No downtime → active-active multi-region
  RTO = 15 min:  Quick failover → warm standby
  RTO = 1 hour:  Manual failover → pilot light
  RTO = 24 hours: Full rebuild → backup & restore

Cost vs. Recovery:

  Strategy          RPO        RTO        Cost
  ─────────────────────────────────────────────
  Backup/Restore    24 hours   24 hours   $ (cheapest)
  Pilot Light       1 hour     1 hour     $$
  Warm Standby      minutes    15 min     $$$
  Active-Active     0          0          $$$$ (most expensive)

DR Architectures

Pilot Light:
  Production: us-east-1 (full infrastructure)
  DR: us-west-2 (database replica + minimal compute)
  
  Normal: DR region runs database replication only
  Failover: Scale up compute in DR region, switch DNS
  RTO: 30-60 minutes
  Cost: ~15% of production cost
  
  ┌──────────────┐          ┌──────────────┐
  │  us-east-1   │  async   │  us-west-2   │
  │              │  repl    │              │
  │  App (full)  │─────────►│  DB replica  │
  │  DB (primary)│          │  (no compute)│
  └──────────────┘          └──────────────┘

Warm Standby:
  Production: us-east-1 (full infrastructure)
  DR: us-west-2 (scaled-down copy of everything)
  
  Normal: DR runs at 20% capacity
  Failover: Scale up DR, switch traffic
  RTO: 5-15 minutes
  Cost: ~30% of production cost
  
  ┌──────────────┐          ┌──────────────┐
  │  us-east-1   │  async   │  us-west-2   │
  │              │  repl    │              │
  │  App (100%)  │─────────►│  App (20%)   │
  │  DB (primary)│          │  DB (replica)│
  └──────────────┘          └──────────────┘

Active-Active:
  Both regions serve traffic simultaneously
  
  RTO: 0 (automatic)
  RPO: 0 (synchronous or conflict resolution)
  Cost: ~100% additional (2x production cost)
  
  ┌──────────────┐   sync   ┌──────────────┐
  │  us-east-1   │◄────────►│  us-west-2   │
  │              │          │              │
  │  App (100%)  │          │  App (100%)  │
  │  DB (multi-  │          │  DB (multi-  │
  │    primary)  │          │    primary)  │
  └──────────────┘          └──────────────┘

Anti-Patterns

Anti-PatternConsequenceFix
No DR testingDiscover DR doesn’t work during actual disasterQuarterly DR drills with full failover
DR plan without runbooksPanic during disaster, slow recoveryStep-by-step runbooks tested quarterly
Backup but never test restoreCorrupt or incomplete backups discovered too lateMonthly restore tests from backup
Same region for DRRegional outage takes out bothCross-region or cross-cloud DR
Manual failover onlyDepends on engineer availability at 3 AMAutomated failover with manual approval option

Disaster recovery is insurance. You pay for it hoping you never need it. But unlike insurance, DR requires regular testing to ensure it actually works when disaster strikes. The worst time to discover your DR plan has a gap is during an actual disaster.

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →