Runbook Engineering
Write operational runbooks that enable anyone to respond to incidents. Covers runbook structure, decision trees, automated diagnostics, escalation paths, pre-computed resolution steps, and the patterns that reduce MTTR by making expert knowledge accessible to everyone on call.
When a pager goes off at 3 AM, the on-call engineer needs to diagnose and resolve the issue without calling the person who built the system. Runbooks bridge that gap: they capture expert knowledge in a structured, actionable format that anyone can follow. A good runbook reduces mean time to resolution (MTTR) from hours to minutes.
Runbook Structure
# Alert: High Error Rate on Payment Service
## Severity: SEV-2
## Expected Resolution Time: 15-30 minutes
## Symptoms
- Error rate > 5% on /api/v1/payments endpoint
- PagerDuty alert: "payment-service-error-rate-high"
- User reports: "Payment failed" messages
## Quick Diagnosis
### Step 1: Check service health
```bash
kubectl get pods -n payments
kubectl top pods -n payments
Expected: All pods Running, CPU < 80%
If pods are CrashLoopBackOff → Go to “Pod Crash” section If pods are healthy → Continue to Step 2
Step 2: Check downstream dependencies
curl -s https://api.stripe.com/v1/charges -H "Authorization: Bearer $STRIPE_KEY" | head -5
Expected: HTTP 200
If Stripe is down → Go to “Stripe Outage” section If Stripe is healthy → Continue to Step 3
Step 3: Check database
kubectl exec -n payments deploy/payment-service -- \
pg_isready -h $DB_HOST -p 5432
Expected: “accepting connections”
If database unreachable → Go to “Database Outage” section
Resolution Playbooks
Pod Crash
- Get crash logs:
kubectl logs -n payments <pod> --previous - Check recent deployments:
kubectl rollout history deploy/payment-service -n payments - If caused by recent deploy:
kubectl rollout undo deploy/payment-service -n payments - If OOM: Increase memory limits in deployment spec
Stripe Outage
- Check https://status.stripe.com
- If Stripe confirmed down:
- Enable payment queue mode (requests queued for retry)
- Post in #incidents: “Stripe outage, payments queued”
- Monitor Stripe status for resolution
- When Stripe recovers: Drain payment queue
Database Outage
- Check RDS console for current status
- If failover needed: Initiate RDS failover
- If connection exhaustion: Restart payment service pods
- Verify connection pool health after recovery
Escalation
- If unresolved after 30 minutes → Page payments team lead
- If customer-facing impact → Notify support team in #support-escalation
---
## Runbook Automation
```python
class RunbookAutomation:
"""Automate diagnostic steps from runbooks."""
def auto_diagnose(self, alert_name: str):
"""Run automated diagnostic tree for a given alert."""
checks = self.diagnostic_tree[alert_name]
results = []
for check in checks:
result = self.run_check(check)
results.append(result)
if result.indicates_root_cause:
return DiagnosisResult(
root_cause=result.finding,
resolution=self.resolutions[result.finding],
confidence=result.confidence,
auto_resolvable=result.finding in self.auto_resolve,
)
return DiagnosisResult(
root_cause="unknown",
resolution="Escalate to team lead",
confidence=0,
)
def run_check(self, check):
"""Execute a single diagnostic check."""
if check.type == "http":
response = requests.get(check.url, timeout=5)
return CheckResult(
name=check.name,
passed=response.status_code == check.expected_status,
finding=f"HTTP {response.status_code}" if response.status_code != check.expected_status else None,
)
elif check.type == "command":
result = subprocess.run(check.command, capture_output=True, timeout=10)
return CheckResult(
name=check.name,
passed=check.expected_output in result.stdout.decode(),
)
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| No runbooks | On-call depends on tribal knowledge | Write runbooks for every alert |
| Outdated runbooks | Steps fail, engineer loses trust | Review runbooks quarterly |
| Too much prose, not enough commands | Slow to scan at 3 AM | Copy-paste commands, decision trees |
| No escalation path | Engineer stuck, incident drags on | Clear escalation with contact info |
| Runbooks only in wiki | Hard to find during incidents | Link runbook URL directly in alert |
Every alert should link to its runbook. Every runbook should be tested by someone who didn’t write it. If a new engineer cannot resolve the alert by following the runbook, the runbook is incomplete.