Backend Health Check Patterns
Implement comprehensive health checks for production services. Covers liveness vs. readiness probes, dependency health, deep health checks, health check aggregation, and the patterns that let orchestrators and load balancers make intelligent routing decisions.
A health check is the simplest question in distributed systems: “Is this service working?” But the answer is rarely binary. A service can be alive but overloaded, connected to the database but unable to reach the cache, or running but with a corrupted configuration. Good health checks distinguish between these states and give the orchestrator enough information to make smart decisions.
Health Check Types
Liveness: "Is the process alive?"
What: Basic process health
When: Kubernetes restarts the pod if liveness fails
Check: Can the HTTP server respond at all?
Endpoint: /healthz
Response: 200 OK or no response (process dead)
NEVER include dependency checks in liveness.
If the database is down and your liveness check fails,
Kubernetes will restart your pod. The database is STILL down.
Now you have a restart loop instead of a degraded service.
Readiness: "Can this instance serve traffic?"
What: Is this instance ready to handle requests?
When: Load balancer removes instance if readiness fails
Check: Dependencies available, warmup complete, not draining?
Endpoint: /ready
Response: 200 (serve traffic) or 503 (remove from pool)
Include dependency checks in readiness.
If the database is down, this instance cannot serve requests.
Load balancer routes to healthy instances instead.
Startup: "Has the service finished initializing?"
What: One-time check during boot
When: Prevents liveness/readiness checks during slow startup
Check: Migrations complete, cache warmed, config loaded?
Endpoint: /startup
Response: 200 when ready, 503 during init
Implementation
from datetime import datetime, timedelta
class HealthChecker:
"""Production health check implementation."""
def liveness(self):
"""Simple liveness: is the process working?"""
return {"status": "ok", "timestamp": datetime.utcnow().isoformat()}
def readiness(self):
"""Readiness: can this instance serve traffic?"""
checks = {
"database": self.check_database(),
"cache": self.check_cache(),
"disk_space": self.check_disk(),
"memory": self.check_memory(),
}
# All critical checks must pass
critical = ["database"]
critical_healthy = all(
checks[c]["status"] == "healthy" for c in critical
)
return {
"status": "ready" if critical_healthy else "not_ready",
"checks": checks,
"timestamp": datetime.utcnow().isoformat(),
}
def deep_health(self):
"""Deep health: full system diagnostic (not for orchestrator)."""
return {
"status": "detailed",
"version": self.app_version,
"uptime_seconds": self.uptime(),
"checks": {
"database": self.check_database_detailed(),
"cache": self.check_cache_detailed(),
"external_apis": self.check_external_apis(),
"queue": self.check_message_queue(),
"disk": self.check_disk_detailed(),
"memory": self.check_memory_detailed(),
"cpu": self.check_cpu(),
},
"metrics": {
"requests_per_second": self.rps(),
"error_rate": self.error_rate(),
"p99_latency_ms": self.p99_latency(),
},
}
def check_database(self):
"""Quick database connectivity check."""
try:
# Simple query with timeout
self.db.execute("SELECT 1", timeout=2)
return {"status": "healthy", "latency_ms": self.last_db_latency}
except Exception as e:
return {"status": "unhealthy", "error": str(e)}
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| Dependency check in liveness probe | Restart loops when dependency is down | Only process health in liveness |
| Health check hits production database hard | Health checks cause load | Lightweight query with timeout |
| No timeout on health checks | Health check hangs, blocks orchestrator | 2-3 second timeout on all checks |
| Same endpoint for all check types | Cannot distinguish liveness vs readiness | Separate endpoints: /healthz, /ready, /health |
| Expose deep health publicly | Security information leakage | Deep health behind auth, basic health public |
Health checks are the communication protocol between your service and the infrastructure. They tell load balancers, orchestrators, and monitoring systems exactly what they need to know to make intelligent decisions about traffic routing, scaling, and alerting.