Reliability Patterns
Implement the fundamental reliability patterns that keep distributed systems running under failure. Covers circuit breakers, bulkheads, timeouts, retries with backoff, graceful degradation, and fallback strategies.
Distributed systems fail. Networks partition, services crash, databases slow down, queues back up. Reliability is not about preventing all failures — it is about designing systems that continue to function acceptably when components fail. These patterns are the building blocks of resilient architecture.
Circuit Breaker
A circuit breaker prevents cascading failures by stopping requests to a failing dependency:
States:
CLOSED → Requests pass through normally
OPEN → Requests fail immediately (no network call)
HALF_OPEN → Limited requests to test if dependency recovered
CLOSED ──(5 failures in 60s)──▶ OPEN
OPEN ──(60s timeout)──▶ HALF_OPEN
HALF_OPEN ──(success)──▶ CLOSED
HALF_OPEN ──(failure)──▶ OPEN
class CircuitBreaker:
def __init__(self, failure_threshold=5, reset_timeout=60):
self.state = 'CLOSED'
self.failure_count = 0
self.failure_threshold = failure_threshold
self.reset_timeout = reset_timeout
self.last_failure_time = None
def call(self, func, *args):
if self.state == 'OPEN':
if time.time() - self.last_failure_time > self.reset_timeout:
self.state = 'HALF_OPEN'
else:
raise CircuitOpenError("Circuit is open")
try:
result = func(*args)
if self.state == 'HALF_OPEN':
self.state = 'CLOSED'
self.failure_count = 0
return result
except Exception as e:
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = 'OPEN'
raise
Bulkhead
Isolate failures to prevent one component from consuming all resources:
Without bulkhead:
Thread Pool (100 threads)
├── Payment API calls (slow/failing) → consumes 98 threads
├── Order API calls → 1 thread (starved)
└── User API calls → 1 thread (starved)
With bulkhead:
Payment Pool (30 threads) → failing, but contained
Order Pool (40 threads) → operating normally
User Pool (30 threads) → operating normally
Timeout Strategy
Every external call needs a timeout:
# Tiered timeouts
TIMEOUT_CONFIG = {
'database': {'connect': 3, 'query': 10},
'cache': {'connect': 1, 'operation': 2},
'payment_api': {'connect': 5, 'request': 30},
'internal_api': {'connect': 2, 'request': 5},
'email_service': {'connect': 3, 'request': 10},
}
# Total request timeout
# Must be less than client's timeout
# Must account for retries
# Example: 2 retries × 5s = 10s + overhead < 30s client timeout
Timeout Ordering
Client timeout: 30s
└── API Gateway timeout: 25s
└── Service timeout: 20s
└── Database timeout: 10s
└── External API timeout: 5s (× 3 retries = 15s max)
Each layer’s timeout must be less than its caller’s timeout.
Retry with Backoff
def retry_with_backoff(func, max_retries=3, base_delay=1.0):
for attempt in range(max_retries + 1):
try:
return func()
except RetryableError:
if attempt == max_retries:
raise
delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
time.sleep(delay)
Retry Budget
Limit total retries to prevent amplification:
Without retry budget:
100 requests × 3 retries each = 300 requests to failing service
(3x amplification during outage)
With retry budget (10%):
100 requests + 10 retries = 110 requests
Only 10% of requests get retried
Graceful Degradation
When a component fails, serve reduced functionality instead of failing entirely:
def get_product_page(product_id):
product = product_service.get(product_id) # Required
try:
reviews = review_service.get_reviews(product_id)
except ServiceError:
reviews = [] # Show page without reviews
try:
recommendations = rec_service.get_recommendations(product_id)
except ServiceError:
recommendations = get_cached_popular_products() # Fallback
return render_page(product, reviews, recommendations)
Degradation Levels
Level 0: Full functionality (all systems healthy)
Level 1: Reduced personalization (recommendation service down)
Level 2: Read-only mode (write path down, serve cached data)
Level 3: Static content (serve cached pages from CDN)
Level 4: Maintenance page (complete failure)
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| No timeouts | Thread pool exhaustion when dependency hangs | Timeout on every external call |
| Retry without backoff | Thundering herd on recovering service | Exponential backoff with jitter |
| No circuit breaker | Cascading failures across services | Circuit breaker on external dependencies |
| Fail completely on partial failure | Users get errors for non-critical features | Graceful degradation with fallbacks |
| Same retry policy everywhere | Over-retry on fast-failing services | Calibrate retries per dependency |
Reliability is not a feature you add — it is a property that emerges from applying these patterns consistently across every service boundary and every external dependency.