Reliability Patterns | The Garnet Wiki

Distributed systems fail. Networks partition, services crash, databases slow down, queues back up. Reliability is not about preventing all failures — it is about designing systems that continue to function acceptably when components fail. These patterns are the building blocks of resilient architecture.

Circuit Breaker

A circuit breaker prevents cascading failures by stopping requests to a failing dependency:

States:
  CLOSED  → Requests pass through normally
  OPEN    → Requests fail immediately (no network call)
  HALF_OPEN → Limited requests to test if dependency recovered

  CLOSED ──(5 failures in 60s)──▶ OPEN
  OPEN ──(60s timeout)──▶ HALF_OPEN
  HALF_OPEN ──(success)──▶ CLOSED
  HALF_OPEN ──(failure)──▶ OPEN

class CircuitBreaker:
    def __init__(self, failure_threshold=5, reset_timeout=60):
        self.state = 'CLOSED'
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.reset_timeout = reset_timeout
        self.last_failure_time = None
    
    def call(self, func, *args):
        if self.state == 'OPEN':
            if time.time() - self.last_failure_time > self.reset_timeout:
                self.state = 'HALF_OPEN'
            else:
                raise CircuitOpenError("Circuit is open")
        
        try:
            result = func(*args)
            if self.state == 'HALF_OPEN':
                self.state = 'CLOSED'
                self.failure_count = 0
            return result
        except Exception as e:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.failure_threshold:
                self.state = 'OPEN'
            raise

Bulkhead

Isolate failures to prevent one component from consuming all resources:

Without bulkhead:
  Thread Pool (100 threads)
  ├── Payment API calls (slow/failing) → consumes 98 threads
  ├── Order API calls → 1 thread (starved)
  └── User API calls → 1 thread (starved)

With bulkhead:
  Payment Pool (30 threads) → failing, but contained
  Order Pool (40 threads)   → operating normally
  User Pool (30 threads)    → operating normally

Timeout Strategy

Every external call needs a timeout:

# Tiered timeouts
TIMEOUT_CONFIG = {
    'database':       {'connect': 3, 'query': 10},
    'cache':          {'connect': 1, 'operation': 2},
    'payment_api':    {'connect': 5, 'request': 30},
    'internal_api':   {'connect': 2, 'request': 5},
    'email_service':  {'connect': 3, 'request': 10},
}

# Total request timeout
# Must be less than client's timeout
# Must account for retries
# Example: 2 retries × 5s = 10s + overhead < 30s client timeout

Timeout Ordering

Client timeout: 30s
  └── API Gateway timeout: 25s
      └── Service timeout: 20s
          └── Database timeout: 10s
          └── External API timeout: 5s (× 3 retries = 15s max)

Each layer’s timeout must be less than its caller’s timeout.

Retry with Backoff

def retry_with_backoff(func, max_retries=3, base_delay=1.0):
    for attempt in range(max_retries + 1):
        try:
            return func()
        except RetryableError:
            if attempt == max_retries:
                raise
            delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
            time.sleep(delay)

Retry Budget

Limit total retries to prevent amplification:

Without retry budget:
  100 requests × 3 retries each = 300 requests to failing service
  (3x amplification during outage)

With retry budget (10%):
  100 requests + 10 retries = 110 requests
  Only 10% of requests get retried

Graceful Degradation

When a component fails, serve reduced functionality instead of failing entirely:

def get_product_page(product_id):
    product = product_service.get(product_id)  # Required
    
    try:
        reviews = review_service.get_reviews(product_id)
    except ServiceError:
        reviews = []  # Show page without reviews
    
    try:
        recommendations = rec_service.get_recommendations(product_id)
    except ServiceError:
        recommendations = get_cached_popular_products()  # Fallback
    
    return render_page(product, reviews, recommendations)

Degradation Levels

Level 0: Full functionality (all systems healthy)
Level 1: Reduced personalization (recommendation service down)
Level 2: Read-only mode (write path down, serve cached data)
Level 3: Static content (serve cached pages from CDN)
Level 4: Maintenance page (complete failure)

Anti-Patterns

Anti-Pattern	Consequence	Fix
No timeouts	Thread pool exhaustion when dependency hangs	Timeout on every external call
Retry without backoff	Thundering herd on recovering service	Exponential backoff with jitter
No circuit breaker	Cascading failures across services	Circuit breaker on external dependencies
Fail completely on partial failure	Users get errors for non-critical features	Graceful degradation with fallbacks
Same retry policy everywhere	Over-retry on fast-failing services	Calibrate retries per dependency

Reliability is not a feature you add — it is a property that emerges from applying these patterns consistently across every service boundary and every external dependency.