High Availability Architecture Patterns
Design systems that stay up. Covers active-active, active-passive, failover strategies, health checks, circuit breakers, bulkheads, and achieving 99.99% uptime.
High availability isn’t about building perfect systems — it’s about building systems that fail gracefully. Hardware fails. Networks partition. Databases corrupt. Region-wide outages happen. HA architecture ensures users experience continuity even when components underneath are failing.
Availability Math
| Target | Allowed Downtime/Year | Allowed Downtime/Month |
|---|---|---|
| 99% (two nines) | 3.65 days | 7.3 hours |
| 99.9% (three nines) | 8.77 hours | 43.8 minutes |
| 99.95% | 4.38 hours | 21.9 minutes |
| 99.99% (four nines) | 52.6 minutes | 4.38 minutes |
| 99.999% (five nines) | 5.26 minutes | 26.3 seconds |
Serial dependency rule: If Service A (99.9%) depends on Service B (99.9%), combined availability = 99.9% × 99.9% = 99.8%.
Parallel redundancy: Two instances at 99% each → 1 - (0.01 × 0.01) = 99.99%.
HA Patterns
Active-Active
Load Balancer
┌──────┴──────┐
┌────▼────┐ ┌────▼────┐
│ Region A │ │ Region B │
│ (active) │ │ (active) │
│ │ │ │
│ App + DB │ │ App + DB │
└────┬─────┘ └────┬─────┘
│ │
└──── Sync ────┘
(replication)
Best for: Global applications, maximum availability, zero-downtime maintenance.
Active-Passive
┌────────────┐ ┌────────────┐
│ Primary │ │ Standby │
│ (active) │───▶│ (passive) │
│ │ │ │
│ App + DB │ │ App + DB │
└────────────┘ └────────────┘
Receives all Ready to
traffic take over
Best for: Disaster recovery, lower cost, simpler operations.
Resilience Patterns
| Pattern | What It Does | When to Use |
|---|---|---|
| Circuit breaker | Stops calling failing service | Prevent cascade failures |
| Bulkhead | Isolate failure to one component | Protect critical paths from non-critical failures |
| Retry + backoff | Retry transient failures | Network blips, temporary overload |
| Timeout | Fail fast instead of hanging | Prevent resource exhaustion |
| Fallback | Return cached/default data | Graceful degradation |
| Rate limiting | Protect from overload | Prevent self-inflicted DDoS |
Health Check Design
@app.get("/health/live")
async def liveness():
"""Is the process running? (K8s restarts if fails)"""
return {"status": "ok"}
@app.get("/health/ready")
async def readiness():
"""Can the service handle requests? (K8s stops routing if fails)"""
checks = {
"database": await check_db_connection(),
"cache": await check_redis_connection(),
"disk_space": check_disk_space_above(threshold_gb=1),
}
all_healthy = all(checks.values())
status_code = 200 if all_healthy else 503
return JSONResponse(
status_code=status_code,
content={"status": "ready" if all_healthy else "not_ready", "checks": checks}
)
Anti-Patterns
| Anti-Pattern | Problem | Fix |
|---|---|---|
| Single point of failure | One component down = everything down | Redundancy at every layer |
| No health checks | Load balancer sends traffic to dead instances | Liveness + readiness probes |
| Retry without backoff | Failed service overwhelmed by retries | Exponential backoff with jitter |
| No circuit breaker | Cascading failures across services | Circuit breaker on all external calls |
| Manual failover | Takes 30+ minutes, human error | Automated failover with health checks |
Checklist
- SLA target defined: availability percentage, RPO, RTO
- Redundancy: no single point of failure at any layer
- Health checks: liveness + readiness on all services
- Circuit breakers on all external dependencies
- Retry with exponential backoff and jitter
- Multi-AZ deployment (minimum for production)
- Automated failover: tested monthly
- Load testing to verify capacity under failure scenarios
- Runbooks for common failure modes
:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For HA architecture consulting, visit garnetgrid.com. :::