ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

Backend Health Check Patterns

Implement comprehensive health checks for production services. Covers liveness vs. readiness probes, dependency health, deep health checks, health check aggregation, and the patterns that let orchestrators and load balancers make intelligent routing decisions.

A health check is the simplest question in distributed systems: “Is this service working?” But the answer is rarely binary. A service can be alive but overloaded, connected to the database but unable to reach the cache, or running but with a corrupted configuration. Good health checks distinguish between these states and give the orchestrator enough information to make smart decisions.


Health Check Types

Liveness: "Is the process alive?"
  What: Basic process health
  When: Kubernetes restarts the pod if liveness fails
  Check: Can the HTTP server respond at all?
  Endpoint: /healthz
  Response: 200 OK or no response (process dead)
  
  NEVER include dependency checks in liveness.
  If the database is down and your liveness check fails,
  Kubernetes will restart your pod. The database is STILL down.
  Now you have a restart loop instead of a degraded service.

Readiness: "Can this instance serve traffic?"
  What: Is this instance ready to handle requests?
  When: Load balancer removes instance if readiness fails
  Check: Dependencies available, warmup complete, not draining?
  Endpoint: /ready
  Response: 200 (serve traffic) or 503 (remove from pool)
  
  Include dependency checks in readiness.
  If the database is down, this instance cannot serve requests.
  Load balancer routes to healthy instances instead.

Startup: "Has the service finished initializing?"
  What: One-time check during boot
  When: Prevents liveness/readiness checks during slow startup
  Check: Migrations complete, cache warmed, config loaded?
  Endpoint: /startup
  Response: 200 when ready, 503 during init

Implementation

from datetime import datetime, timedelta

class HealthChecker:
    """Production health check implementation."""
    
    def liveness(self):
        """Simple liveness: is the process working?"""
        return {"status": "ok", "timestamp": datetime.utcnow().isoformat()}
    
    def readiness(self):
        """Readiness: can this instance serve traffic?"""
        checks = {
            "database": self.check_database(),
            "cache": self.check_cache(),
            "disk_space": self.check_disk(),
            "memory": self.check_memory(),
        }
        
        # All critical checks must pass
        critical = ["database"]
        critical_healthy = all(
            checks[c]["status"] == "healthy" for c in critical
        )
        
        return {
            "status": "ready" if critical_healthy else "not_ready",
            "checks": checks,
            "timestamp": datetime.utcnow().isoformat(),
        }
    
    def deep_health(self):
        """Deep health: full system diagnostic (not for orchestrator)."""
        return {
            "status": "detailed",
            "version": self.app_version,
            "uptime_seconds": self.uptime(),
            "checks": {
                "database": self.check_database_detailed(),
                "cache": self.check_cache_detailed(),
                "external_apis": self.check_external_apis(),
                "queue": self.check_message_queue(),
                "disk": self.check_disk_detailed(),
                "memory": self.check_memory_detailed(),
                "cpu": self.check_cpu(),
            },
            "metrics": {
                "requests_per_second": self.rps(),
                "error_rate": self.error_rate(),
                "p99_latency_ms": self.p99_latency(),
            },
        }
    
    def check_database(self):
        """Quick database connectivity check."""
        try:
            # Simple query with timeout
            self.db.execute("SELECT 1", timeout=2)
            return {"status": "healthy", "latency_ms": self.last_db_latency}
        except Exception as e:
            return {"status": "unhealthy", "error": str(e)}

Anti-Patterns

Anti-PatternConsequenceFix
Dependency check in liveness probeRestart loops when dependency is downOnly process health in liveness
Health check hits production database hardHealth checks cause loadLightweight query with timeout
No timeout on health checksHealth check hangs, blocks orchestrator2-3 second timeout on all checks
Same endpoint for all check typesCannot distinguish liveness vs readinessSeparate endpoints: /healthz, /ready, /health
Expose deep health publiclySecurity information leakageDeep health behind auth, basic health public

Health checks are the communication protocol between your service and the infrastructure. They tell load balancers, orchestrators, and monitoring systems exactly what they need to know to make intelligent decisions about traffic routing, scaling, and alerting.

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →