Backend Health Check Patterns

A health check is the simplest question in distributed systems: “Is this service working?” But the answer is rarely binary. A service can be alive but overloaded, connected to the database but unable to reach the cache, or running but with a corrupted configuration. Good health checks distinguish between these states and give the orchestrator enough information to make smart decisions.

Health Check Types

Liveness: "Is the process alive?"
  What: Basic process health
  When: Kubernetes restarts the pod if liveness fails
  Check: Can the HTTP server respond at all?
  Endpoint: /healthz
  Response: 200 OK or no response (process dead)
  
  NEVER include dependency checks in liveness.
  If the database is down and your liveness check fails,
  Kubernetes will restart your pod. The database is STILL down.
  Now you have a restart loop instead of a degraded service.

Readiness: "Can this instance serve traffic?"
  What: Is this instance ready to handle requests?
  When: Load balancer removes instance if readiness fails
  Check: Dependencies available, warmup complete, not draining?
  Endpoint: /ready
  Response: 200 (serve traffic) or 503 (remove from pool)
  
  Include dependency checks in readiness.
  If the database is down, this instance cannot serve requests.
  Load balancer routes to healthy instances instead.

Startup: "Has the service finished initializing?"
  What: One-time check during boot
  When: Prevents liveness/readiness checks during slow startup
  Check: Migrations complete, cache warmed, config loaded?
  Endpoint: /startup
  Response: 200 when ready, 503 during init

Implementation

from datetime import datetime, timedelta

class HealthChecker:
    """Production health check implementation."""
    
    def liveness(self):
        """Simple liveness: is the process working?"""
        return {"status": "ok", "timestamp": datetime.utcnow().isoformat()}
    
    def readiness(self):
        """Readiness: can this instance serve traffic?"""
        checks = {
            "database": self.check_database(),
            "cache": self.check_cache(),
            "disk_space": self.check_disk(),
            "memory": self.check_memory(),
        }
        
        # All critical checks must pass
        critical = ["database"]
        critical_healthy = all(
            checks[c]["status"] == "healthy" for c in critical
        )
        
        return {
            "status": "ready" if critical_healthy else "not_ready",
            "checks": checks,
            "timestamp": datetime.utcnow().isoformat(),
        }
    
    def deep_health(self):
        """Deep health: full system diagnostic (not for orchestrator)."""
        return {
            "status": "detailed",
            "version": self.app_version,
            "uptime_seconds": self.uptime(),
            "checks": {
                "database": self.check_database_detailed(),
                "cache": self.check_cache_detailed(),
                "external_apis": self.check_external_apis(),
                "queue": self.check_message_queue(),
                "disk": self.check_disk_detailed(),
                "memory": self.check_memory_detailed(),
                "cpu": self.check_cpu(),
            },
            "metrics": {
                "requests_per_second": self.rps(),
                "error_rate": self.error_rate(),
                "p99_latency_ms": self.p99_latency(),
            },
        }
    
    def check_database(self):
        """Quick database connectivity check."""
        try:
            # Simple query with timeout
            self.db.execute("SELECT 1", timeout=2)
            return {"status": "healthy", "latency_ms": self.last_db_latency}
        except Exception as e:
            return {"status": "unhealthy", "error": str(e)}

Anti-Patterns

Anti-Pattern	Consequence	Fix
Dependency check in liveness probe	Restart loops when dependency is down	Only process health in liveness
Health check hits production database hard	Health checks cause load	Lightweight query with timeout
No timeout on health checks	Health check hangs, blocks orchestrator	2-3 second timeout on all checks
Same endpoint for all check types	Cannot distinguish liveness vs readiness	Separate endpoints: /healthz, /ready, /health
Expose deep health publicly	Security information leakage	Deep health behind auth, basic health public

Health checks are the communication protocol between your service and the infrastructure. They tell load balancers, orchestrators, and monitoring systems exactly what they need to know to make intelligent decisions about traffic routing, scaling, and alerting.

Health Check Types

Implementation

Anti-Patterns

More in Backend Engineering

API Gateway Patterns: The Front Door to Your Microservices

API Versioning Strategies and Patterns

API Versioning Strategies