Runbook Engineering | The Garnet Wiki

When a pager goes off at 3 AM, the on-call engineer needs to diagnose and resolve the issue without calling the person who built the system. Runbooks bridge that gap: they capture expert knowledge in a structured, actionable format that anyone can follow. A good runbook reduces mean time to resolution (MTTR) from hours to minutes.

Runbook Structure

# Alert: High Error Rate on Payment Service

## Severity: SEV-2
## Expected Resolution Time: 15-30 minutes

## Symptoms
- Error rate > 5% on /api/v1/payments endpoint
- PagerDuty alert: "payment-service-error-rate-high"
- User reports: "Payment failed" messages

## Quick Diagnosis

### Step 1: Check service health
```bash
kubectl get pods -n payments
kubectl top pods -n payments

Expected: All pods Running, CPU < 80%

If pods are CrashLoopBackOff → Go to “Pod Crash” section If pods are healthy → Continue to Step 2

Step 2: Check downstream dependencies

curl -s https://api.stripe.com/v1/charges -H "Authorization: Bearer $STRIPE_KEY" | head -5

Expected: HTTP 200

If Stripe is down → Go to “Stripe Outage” section If Stripe is healthy → Continue to Step 3

Step 3: Check database

kubectl exec -n payments deploy/payment-service -- \
  pg_isready -h $DB_HOST -p 5432

Expected: “accepting connections”

If database unreachable → Go to “Database Outage” section

Resolution Playbooks

Pod Crash

Get crash logs: kubectl logs -n payments <pod> --previous
Check recent deployments: kubectl rollout history deploy/payment-service -n payments
If caused by recent deploy: kubectl rollout undo deploy/payment-service -n payments
If OOM: Increase memory limits in deployment spec

Stripe Outage

Check https://status.stripe.com
If Stripe confirmed down:
- Enable payment queue mode (requests queued for retry)
- Post in #incidents: “Stripe outage, payments queued”
- Monitor Stripe status for resolution
When Stripe recovers: Drain payment queue

Database Outage

Check RDS console for current status
If failover needed: Initiate RDS failover
If connection exhaustion: Restart payment service pods
Verify connection pool health after recovery

Escalation

If unresolved after 30 minutes → Page payments team lead
If customer-facing impact → Notify support team in #support-escalation


---

## Runbook Automation

```python
class RunbookAutomation:
    """Automate diagnostic steps from runbooks."""
    
    def auto_diagnose(self, alert_name: str):
        """Run automated diagnostic tree for a given alert."""
        checks = self.diagnostic_tree[alert_name]
        results = []
        
        for check in checks:
            result = self.run_check(check)
            results.append(result)
            
            if result.indicates_root_cause:
                return DiagnosisResult(
                    root_cause=result.finding,
                    resolution=self.resolutions[result.finding],
                    confidence=result.confidence,
                    auto_resolvable=result.finding in self.auto_resolve,
                )
        
        return DiagnosisResult(
            root_cause="unknown",
            resolution="Escalate to team lead",
            confidence=0,
        )
    
    def run_check(self, check):
        """Execute a single diagnostic check."""
        if check.type == "http":
            response = requests.get(check.url, timeout=5)
            return CheckResult(
                name=check.name,
                passed=response.status_code == check.expected_status,
                finding=f"HTTP {response.status_code}" if response.status_code != check.expected_status else None,
            )
        elif check.type == "command":
            result = subprocess.run(check.command, capture_output=True, timeout=10)
            return CheckResult(
                name=check.name,
                passed=check.expected_output in result.stdout.decode(),
            )

Anti-Patterns

Anti-Pattern	Consequence	Fix
No runbooks	On-call depends on tribal knowledge	Write runbooks for every alert
Outdated runbooks	Steps fail, engineer loses trust	Review runbooks quarterly
Too much prose, not enough commands	Slow to scan at 3 AM	Copy-paste commands, decision trees
No escalation path	Engineer stuck, incident drags on	Clear escalation with contact info
Runbooks only in wiki	Hard to find during incidents	Link runbook URL directly in alert

Every alert should link to its runbook. Every runbook should be tested by someone who didn’t write it. If a new engineer cannot resolve the alert by following the runbook, the runbook is incomplete.