Self-Healing Infrastructure: Automated Recovery Without Human Intervention
Build infrastructure that detects failures and recovers automatically. Covers health check design, auto-restart policies, auto-scaling, circuit breakers, automated runbooks, and the observability foundation required to trust automated remediation.
Self-healing infrastructure is the practice of building systems that detect degradation and take corrective action without a human picking up the phone at 3 AM. It is not about eliminating outages — it is about eliminating the ones that are routine, predictable, and mechanically fixable.
The 2 AM page for a full disk, a crashed process, or a hung connection pool is not an engineering problem. It is an automation gap.
The Self-Healing Stack
Self-healing operates at multiple layers:
Layer 4: Application → Circuit breakers, retry logic, graceful degradation
Layer 3: Container → Liveness/readiness probes, restart policies
Layer 2: Orchestrator → Auto-scaling, node replacement, pod rescheduling
Layer 1: Infrastructure → Instance recovery, AZ failover, DNS failover
Each layer handles failures invisible to the layers above it. A crashed container restarts before the orchestrator notices. A failing node gets drained before the application sees errors.
Health Check Design
Every self-healing mechanism depends on health checks. Bad health checks cause two problems: false positives (killing healthy services) and false negatives (ignoring sick ones).
Liveness vs. Readiness
- Liveness: “Is this process alive?” Failure triggers a restart.
- Readiness: “Can this process serve traffic?” Failure removes it from the load balancer.
livenessProbe:
httpGet:
path: /healthz/live
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /healthz/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 2
What to Check
Liveness should verify the process can respond at all. Keep it simple:
@app.get("/healthz/live")
def liveness():
return {"status": "alive"}
Readiness should verify the process can do useful work:
@app.get("/healthz/ready")
async def readiness():
checks = {}
# Database reachable?
try:
await db.execute("SELECT 1")
checks["database"] = "ok"
except Exception:
checks["database"] = "failed"
# Cache reachable?
try:
await redis.ping()
checks["cache"] = "ok"
except Exception:
checks["cache"] = "failed"
all_ok = all(v == "ok" for v in checks.values())
status_code = 200 if all_ok else 503
return JSONResponse(checks, status_code=status_code)
Anti-Pattern: Deep Health Checks in Liveness
If your liveness probe checks the database, and the database goes down, Kubernetes restarts all your pods simultaneously. This causes a cascade failure — all pods restart, try to reconnect to the same database, and overwhelm it with connection storms.
Rule: Liveness checks the process. Readiness checks dependencies.
Container Restart Policies
Kubernetes automatically restarts failed containers with exponential backoff:
Attempt 1: Restart immediately
Attempt 2: Wait 10s
Attempt 3: Wait 20s
Attempt 4: Wait 40s
...
Maximum: Wait 5 minutes between restarts
This handles transient failures (OOM kills, uncaught exceptions) without operator intervention. For persistent failures, the backoff prevents a tight restart-crash loop from consuming cluster resources.
Pod Disruption Budgets
Protect against too many pods failing simultaneously:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: order-service
spec:
minAvailable: 2
selector:
matchLabels:
app: order-service
This guarantees at least 2 pods remain available during voluntary disruptions (node drains, cluster upgrades).
Auto-Scaling as Self-Healing
Auto-scaling is not just about cost — it is about resilience. When traffic spikes beyond current capacity, adding instances is an automated recovery action.
Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: order-service
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: order-service
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
Custom Metrics Scaling
Scale on business signals, not just CPU:
metrics:
- type: Pods
pods:
metric:
name: queue_depth
target:
type: AverageValue
averageValue: 10
Circuit Breakers
Circuit breakers prevent a failing downstream service from cascading failures upstream:
State: CLOSED (normal)
→ Request fails
→ Failure count increments
→ Threshold exceeded (5 failures in 30s)
State: OPEN (rejecting)
→ All requests fail fast with fallback response
→ Timer expires after 30s
State: HALF-OPEN (testing)
→ Allow one request through
→ If success → CLOSED
→ If failure → OPEN
from circuitbreaker import circuit
@circuit(failure_threshold=5, recovery_timeout=30)
def call_payment_service(order_id):
response = httpx.post(f"{PAYMENT_URL}/charge", json={"order": order_id})
response.raise_for_status()
return response.json()
Fallback Responses
When the circuit is open, return a degraded but functional response:
try:
recommendations = call_recommendation_service(user_id)
except CircuitBreakerError:
recommendations = get_popular_items() # Static fallback
Automated Runbooks
For failures that require multi-step remediation, automated runbooks codify the response:
async def handle_disk_full_alert(alert):
"""Automated runbook: disk space exceeding 90%"""
# Step 1: Clean known safe targets
cleaned_mb = await clean_temp_files(alert.host)
await clean_old_logs(alert.host, days=7)
# Step 2: Check if resolved
current_usage = await check_disk_usage(alert.host)
if current_usage < 80:
await notify_slack(f"Self-healed: {alert.host} disk at {current_usage}%")
return
# Step 3: Expand volume if cloud
if alert.host.is_cloud:
await expand_ebs_volume(alert.host, increase_gb=50)
await notify_slack(f"Expanded volume on {alert.host}")
return
# Step 4: Escalate if nothing worked
await page_oncall(f"Disk full on {alert.host}, automated remediation failed")
Guardrails
Automated remediation must have safety limits:
- Maximum actions per hour: Prevent automation from running in a loop
- Scope limits: Never auto-remediate production databases
- Approval gates: Destructive actions require human confirmation
- Audit trail: Log every automated action for post-incident review
Building Trust in Automation
The biggest barrier to self-healing is not technology — it is trust. Teams resist automation because they have seen bad automation make things worse.
The Progressive Trust Model
- Alert only: Automation detects the issue and pages a human
- Suggest action: Automation recommends a fix, human approves
- Act and notify: Automation fixes it and notifies the human
- Silent healing: Automation fixes it; human only sees the summary
Start at level 1 for every new runbook. Promote to level 4 only after dozens of successful automated interventions.
Anti-Patterns
| Anti-Pattern | Risk | Mitigation |
|---|---|---|
| Restarting everything | Cascading failures, thundering herd | Stagger restarts, use PDBs |
| Deep liveness probes | False positive restarts during dependency outages | Liveness = process, readiness = deps |
| Scaling without limits | Cost runaway, resource exhaustion | Always set maxReplicas |
| Auto-remediating unknowns | Making novel failures worse | Only automate confirmed patterns |
| No observability | Cannot distinguish healing from hiding problems | Log every automated action |
Self-healing does not make your systems reliable. It reduces the mean time to recovery for predictable failures. Unpredictable failures still require humans. The skill is knowing which category each failure belongs to — and building automation only for the former.