High Availability Architecture Patterns

High availability isn’t about building perfect systems — it’s about building systems that fail gracefully. Hardware fails. Networks partition. Databases corrupt. Region-wide outages happen. HA architecture ensures users experience continuity even when components underneath are failing.

Availability Math

Target	Allowed Downtime/Year	Allowed Downtime/Month
99% (two nines)	3.65 days	7.3 hours
99.9% (three nines)	8.77 hours	43.8 minutes
99.95%	4.38 hours	21.9 minutes
99.99% (four nines)	52.6 minutes	4.38 minutes
99.999% (five nines)	5.26 minutes	26.3 seconds

Serial dependency rule: If Service A (99.9%) depends on Service B (99.9%), combined availability = 99.9% × 99.9% = 99.8%.

Parallel redundancy: Two instances at 99% each → 1 - (0.01 × 0.01) = 99.99%.

HA Patterns

Active-Active

                Load Balancer
               ┌──────┴──────┐
          ┌────▼────┐   ┌────▼────┐
          │ Region A │   │ Region B │
          │ (active) │   │ (active) │
          │          │   │          │
          │ App + DB │   │ App + DB │
          └────┬─────┘   └────┬─────┘
               │              │
               └──── Sync ────┘
                (replication)

Best for: Global applications, maximum availability, zero-downtime maintenance.

Active-Passive

          ┌────────────┐    ┌────────────┐
          │ Primary    │    │ Standby    │
          │ (active)   │───▶│ (passive)  │
          │            │    │            │
          │ App + DB   │    │ App + DB   │
          └────────────┘    └────────────┘
           Receives all       Ready to
           traffic            take over

Best for: Disaster recovery, lower cost, simpler operations.

Resilience Patterns

Pattern	What It Does	When to Use
Circuit breaker	Stops calling failing service	Prevent cascade failures
Bulkhead	Isolate failure to one component	Protect critical paths from non-critical failures
Retry + backoff	Retry transient failures	Network blips, temporary overload
Timeout	Fail fast instead of hanging	Prevent resource exhaustion
Fallback	Return cached/default data	Graceful degradation
Rate limiting	Protect from overload	Prevent self-inflicted DDoS

Health Check Design

@app.get("/health/live")
async def liveness():
    """Is the process running? (K8s restarts if fails)"""
    return {"status": "ok"}

@app.get("/health/ready")
async def readiness():
    """Can the service handle requests? (K8s stops routing if fails)"""
    checks = {
        "database": await check_db_connection(),
        "cache": await check_redis_connection(),
        "disk_space": check_disk_space_above(threshold_gb=1),
    }
    
    all_healthy = all(checks.values())
    status_code = 200 if all_healthy else 503
    
    return JSONResponse(
        status_code=status_code,
        content={"status": "ready" if all_healthy else "not_ready", "checks": checks}
    )

Anti-Patterns

Anti-Pattern	Problem	Fix
Single point of failure	One component down = everything down	Redundancy at every layer
No health checks	Load balancer sends traffic to dead instances	Liveness + readiness probes
Retry without backoff	Failed service overwhelmed by retries	Exponential backoff with jitter
No circuit breaker	Cascading failures across services	Circuit breaker on all external calls
Manual failover	Takes 30+ minutes, human error	Automated failover with health checks

Checklist

SLA target defined: availability percentage, RPO, RTO
Redundancy: no single point of failure at any layer
Health checks: liveness + readiness on all services
Circuit breakers on all external dependencies
Retry with exponential backoff and jitter
Multi-AZ deployment (minimum for production)
Automated failover: tested monthly
Load testing to verify capacity under failure scenarios
Runbooks for common failure modes

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For HA architecture consulting, visit garnetgrid.com. :::