ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

Reliability Patterns

Implement the fundamental reliability patterns that keep distributed systems running under failure. Covers circuit breakers, bulkheads, timeouts, retries with backoff, graceful degradation, and fallback strategies.

Distributed systems fail. Networks partition, services crash, databases slow down, queues back up. Reliability is not about preventing all failures — it is about designing systems that continue to function acceptably when components fail. These patterns are the building blocks of resilient architecture.


Circuit Breaker

A circuit breaker prevents cascading failures by stopping requests to a failing dependency:

States:
  CLOSED  → Requests pass through normally
  OPEN    → Requests fail immediately (no network call)
  HALF_OPEN → Limited requests to test if dependency recovered

  CLOSED ──(5 failures in 60s)──▶ OPEN
  OPEN ──(60s timeout)──▶ HALF_OPEN
  HALF_OPEN ──(success)──▶ CLOSED
  HALF_OPEN ──(failure)──▶ OPEN
class CircuitBreaker:
    def __init__(self, failure_threshold=5, reset_timeout=60):
        self.state = 'CLOSED'
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.reset_timeout = reset_timeout
        self.last_failure_time = None
    
    def call(self, func, *args):
        if self.state == 'OPEN':
            if time.time() - self.last_failure_time > self.reset_timeout:
                self.state = 'HALF_OPEN'
            else:
                raise CircuitOpenError("Circuit is open")
        
        try:
            result = func(*args)
            if self.state == 'HALF_OPEN':
                self.state = 'CLOSED'
                self.failure_count = 0
            return result
        except Exception as e:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.failure_threshold:
                self.state = 'OPEN'
            raise

Bulkhead

Isolate failures to prevent one component from consuming all resources:

Without bulkhead:
  Thread Pool (100 threads)
  ├── Payment API calls (slow/failing) → consumes 98 threads
  ├── Order API calls → 1 thread (starved)
  └── User API calls → 1 thread (starved)

With bulkhead:
  Payment Pool (30 threads) → failing, but contained
  Order Pool (40 threads)   → operating normally
  User Pool (30 threads)    → operating normally

Timeout Strategy

Every external call needs a timeout:

# Tiered timeouts
TIMEOUT_CONFIG = {
    'database':       {'connect': 3, 'query': 10},
    'cache':          {'connect': 1, 'operation': 2},
    'payment_api':    {'connect': 5, 'request': 30},
    'internal_api':   {'connect': 2, 'request': 5},
    'email_service':  {'connect': 3, 'request': 10},
}

# Total request timeout
# Must be less than client's timeout
# Must account for retries
# Example: 2 retries × 5s = 10s + overhead < 30s client timeout

Timeout Ordering

Client timeout: 30s
  └── API Gateway timeout: 25s
      └── Service timeout: 20s
          └── Database timeout: 10s
          └── External API timeout: 5s (× 3 retries = 15s max)

Each layer’s timeout must be less than its caller’s timeout.


Retry with Backoff

def retry_with_backoff(func, max_retries=3, base_delay=1.0):
    for attempt in range(max_retries + 1):
        try:
            return func()
        except RetryableError:
            if attempt == max_retries:
                raise
            delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
            time.sleep(delay)

Retry Budget

Limit total retries to prevent amplification:

Without retry budget:
  100 requests × 3 retries each = 300 requests to failing service
  (3x amplification during outage)

With retry budget (10%):
  100 requests + 10 retries = 110 requests
  Only 10% of requests get retried

Graceful Degradation

When a component fails, serve reduced functionality instead of failing entirely:

def get_product_page(product_id):
    product = product_service.get(product_id)  # Required
    
    try:
        reviews = review_service.get_reviews(product_id)
    except ServiceError:
        reviews = []  # Show page without reviews
    
    try:
        recommendations = rec_service.get_recommendations(product_id)
    except ServiceError:
        recommendations = get_cached_popular_products()  # Fallback
    
    return render_page(product, reviews, recommendations)

Degradation Levels

Level 0: Full functionality (all systems healthy)
Level 1: Reduced personalization (recommendation service down)
Level 2: Read-only mode (write path down, serve cached data)
Level 3: Static content (serve cached pages from CDN)
Level 4: Maintenance page (complete failure)

Anti-Patterns

Anti-PatternConsequenceFix
No timeoutsThread pool exhaustion when dependency hangsTimeout on every external call
Retry without backoffThundering herd on recovering serviceExponential backoff with jitter
No circuit breakerCascading failures across servicesCircuit breaker on external dependencies
Fail completely on partial failureUsers get errors for non-critical featuresGraceful degradation with fallbacks
Same retry policy everywhereOver-retry on fast-failing servicesCalibrate retries per dependency

Reliability is not a feature you add — it is a property that emerges from applying these patterns consistently across every service boundary and every external dependency.

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →