ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

High Availability Architecture Patterns

Design systems that stay up. Covers active-active, active-passive, failover strategies, health checks, circuit breakers, bulkheads, and achieving 99.99% uptime.

High availability isn’t about building perfect systems — it’s about building systems that fail gracefully. Hardware fails. Networks partition. Databases corrupt. Region-wide outages happen. HA architecture ensures users experience continuity even when components underneath are failing.


Availability Math

TargetAllowed Downtime/YearAllowed Downtime/Month
99% (two nines)3.65 days7.3 hours
99.9% (three nines)8.77 hours43.8 minutes
99.95%4.38 hours21.9 minutes
99.99% (four nines)52.6 minutes4.38 minutes
99.999% (five nines)5.26 minutes26.3 seconds

Serial dependency rule: If Service A (99.9%) depends on Service B (99.9%), combined availability = 99.9% × 99.9% = 99.8%.

Parallel redundancy: Two instances at 99% each → 1 - (0.01 × 0.01) = 99.99%.


HA Patterns

Active-Active

                Load Balancer
               ┌──────┴──────┐
          ┌────▼────┐   ┌────▼────┐
          │ Region A │   │ Region B │
          │ (active) │   │ (active) │
          │          │   │          │
          │ App + DB │   │ App + DB │
          └────┬─────┘   └────┬─────┘
               │              │
               └──── Sync ────┘
                (replication)

Best for: Global applications, maximum availability, zero-downtime maintenance.

Active-Passive

          ┌────────────┐    ┌────────────┐
          │ Primary    │    │ Standby    │
          │ (active)   │───▶│ (passive)  │
          │            │    │            │
          │ App + DB   │    │ App + DB   │
          └────────────┘    └────────────┘
           Receives all       Ready to
           traffic            take over

Best for: Disaster recovery, lower cost, simpler operations.


Resilience Patterns

PatternWhat It DoesWhen to Use
Circuit breakerStops calling failing servicePrevent cascade failures
BulkheadIsolate failure to one componentProtect critical paths from non-critical failures
Retry + backoffRetry transient failuresNetwork blips, temporary overload
TimeoutFail fast instead of hangingPrevent resource exhaustion
FallbackReturn cached/default dataGraceful degradation
Rate limitingProtect from overloadPrevent self-inflicted DDoS

Health Check Design

@app.get("/health/live")
async def liveness():
    """Is the process running? (K8s restarts if fails)"""
    return {"status": "ok"}

@app.get("/health/ready")
async def readiness():
    """Can the service handle requests? (K8s stops routing if fails)"""
    checks = {
        "database": await check_db_connection(),
        "cache": await check_redis_connection(),
        "disk_space": check_disk_space_above(threshold_gb=1),
    }
    
    all_healthy = all(checks.values())
    status_code = 200 if all_healthy else 503
    
    return JSONResponse(
        status_code=status_code,
        content={"status": "ready" if all_healthy else "not_ready", "checks": checks}
    )

Anti-Patterns

Anti-PatternProblemFix
Single point of failureOne component down = everything downRedundancy at every layer
No health checksLoad balancer sends traffic to dead instancesLiveness + readiness probes
Retry without backoffFailed service overwhelmed by retriesExponential backoff with jitter
No circuit breakerCascading failures across servicesCircuit breaker on all external calls
Manual failoverTakes 30+ minutes, human errorAutomated failover with health checks

Checklist

  • SLA target defined: availability percentage, RPO, RTO
  • Redundancy: no single point of failure at any layer
  • Health checks: liveness + readiness on all services
  • Circuit breakers on all external dependencies
  • Retry with exponential backoff and jitter
  • Multi-AZ deployment (minimum for production)
  • Automated failover: tested monthly
  • Load testing to verify capacity under failure scenarios
  • Runbooks for common failure modes

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For HA architecture consulting, visit garnetgrid.com. :::

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →