ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

Runbook Engineering

Write operational runbooks that enable anyone to respond to incidents. Covers runbook structure, decision trees, automated diagnostics, escalation paths, pre-computed resolution steps, and the patterns that reduce MTTR by making expert knowledge accessible to everyone on call.

When a pager goes off at 3 AM, the on-call engineer needs to diagnose and resolve the issue without calling the person who built the system. Runbooks bridge that gap: they capture expert knowledge in a structured, actionable format that anyone can follow. A good runbook reduces mean time to resolution (MTTR) from hours to minutes.


Runbook Structure

# Alert: High Error Rate on Payment Service

## Severity: SEV-2
## Expected Resolution Time: 15-30 minutes

## Symptoms
- Error rate > 5% on /api/v1/payments endpoint
- PagerDuty alert: "payment-service-error-rate-high"
- User reports: "Payment failed" messages

## Quick Diagnosis

### Step 1: Check service health
```bash
kubectl get pods -n payments
kubectl top pods -n payments

Expected: All pods Running, CPU < 80%

If pods are CrashLoopBackOff → Go to “Pod Crash” section If pods are healthy → Continue to Step 2

Step 2: Check downstream dependencies

curl -s https://api.stripe.com/v1/charges -H "Authorization: Bearer $STRIPE_KEY" | head -5

Expected: HTTP 200

If Stripe is down → Go to “Stripe Outage” section If Stripe is healthy → Continue to Step 3

Step 3: Check database

kubectl exec -n payments deploy/payment-service -- \
  pg_isready -h $DB_HOST -p 5432

Expected: “accepting connections”

If database unreachable → Go to “Database Outage” section

Resolution Playbooks

Pod Crash

  1. Get crash logs: kubectl logs -n payments <pod> --previous
  2. Check recent deployments: kubectl rollout history deploy/payment-service -n payments
  3. If caused by recent deploy: kubectl rollout undo deploy/payment-service -n payments
  4. If OOM: Increase memory limits in deployment spec

Stripe Outage

  1. Check https://status.stripe.com
  2. If Stripe confirmed down:
    • Enable payment queue mode (requests queued for retry)
    • Post in #incidents: “Stripe outage, payments queued”
    • Monitor Stripe status for resolution
  3. When Stripe recovers: Drain payment queue

Database Outage

  1. Check RDS console for current status
  2. If failover needed: Initiate RDS failover
  3. If connection exhaustion: Restart payment service pods
  4. Verify connection pool health after recovery

Escalation

  • If unresolved after 30 minutes → Page payments team lead
  • If customer-facing impact → Notify support team in #support-escalation

---

## Runbook Automation

```python
class RunbookAutomation:
    """Automate diagnostic steps from runbooks."""
    
    def auto_diagnose(self, alert_name: str):
        """Run automated diagnostic tree for a given alert."""
        checks = self.diagnostic_tree[alert_name]
        results = []
        
        for check in checks:
            result = self.run_check(check)
            results.append(result)
            
            if result.indicates_root_cause:
                return DiagnosisResult(
                    root_cause=result.finding,
                    resolution=self.resolutions[result.finding],
                    confidence=result.confidence,
                    auto_resolvable=result.finding in self.auto_resolve,
                )
        
        return DiagnosisResult(
            root_cause="unknown",
            resolution="Escalate to team lead",
            confidence=0,
        )
    
    def run_check(self, check):
        """Execute a single diagnostic check."""
        if check.type == "http":
            response = requests.get(check.url, timeout=5)
            return CheckResult(
                name=check.name,
                passed=response.status_code == check.expected_status,
                finding=f"HTTP {response.status_code}" if response.status_code != check.expected_status else None,
            )
        elif check.type == "command":
            result = subprocess.run(check.command, capture_output=True, timeout=10)
            return CheckResult(
                name=check.name,
                passed=check.expected_output in result.stdout.decode(),
            )

Anti-Patterns

Anti-PatternConsequenceFix
No runbooksOn-call depends on tribal knowledgeWrite runbooks for every alert
Outdated runbooksSteps fail, engineer loses trustReview runbooks quarterly
Too much prose, not enough commandsSlow to scan at 3 AMCopy-paste commands, decision trees
No escalation pathEngineer stuck, incident drags onClear escalation with contact info
Runbooks only in wikiHard to find during incidentsLink runbook URL directly in alert

Every alert should link to its runbook. Every runbook should be tested by someone who didn’t write it. If a new engineer cannot resolve the alert by following the runbook, the runbook is incomplete.

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →