Site Reliability Engineering Runbook Framework

Runbooks are the difference between a 5-minute incident and a 5-hour outage. When your production system fails at 3 AM, the on-call engineer shouldn’t be debugging from scratch — they should be following a tested, documented procedure that leads to resolution. Every minute spent searching for context during an incident is a minute of customer impact.

Yet most organizations treat runbooks as an afterthought. They’re outdated Word documents in a forgotten SharePoint folder, written by someone who left the company two years ago. Production-grade SRE requires living runbooks that are maintained alongside the code they support.

Runbook Structure

Every runbook follows the same skeleton:

# [Service Name] — [Failure Mode]

## Severity: P1/P2/P3/P4

## Symptoms
- What the alert looks like
- What users are experiencing
- What dashboards show

## Immediate Actions (First 5 Minutes)
1. Step-by-step diagnostic commands
2. Common quick fixes
3. Escalation criteria

## Root Cause Investigation
- Diagnostic queries and commands
- Log locations and search patterns
- Common root causes in order of likelihood

## Resolution Steps
- Detailed fix procedures for each root cause
- Rollback instructions
- Verification steps

## Post-Incident
- What to check after resolution
- Metrics to monitor for recurrence
- Follow-up actions

SLO-Based Alerting

Traditional alerting (CPU > 90%, disk > 80%) generates noise. SLO-based alerting tells you when users are actually impacted.

Defining SLOs

Service	SLI	SLO	Error Budget (30 days)
API Gateway	Request success rate	99.9%	43.2 minutes downtime
Search	P99 latency	< 500ms	0.1% of requests can be slow
Payments	Transaction success rate	99.99%	4.3 minutes downtime
Dashboard	Page load time	< 3s (P95)	5% of loads can be slow

Error Budget Policy

The error budget is your innovation currency. When the budget is healthy, ship aggressively. When it’s depleted, freeze deployments and fix reliability.

Budget Remaining > 50%: Ship normally, deploy daily
Budget Remaining 25-50%: Slow deployments, extra review
Budget Remaining 10-25%: Critical fixes only, no new features
Budget Remaining < 10%: Deployment freeze, all hands on reliability

Incident Response Procedure

Phase 1: Detect & Triage (0-5 minutes)

Acknowledge the alert
Assess impact scope (how many users, which regions)
Assign severity level
Open incident channel (#inc-YYYYMMDD-brief-description)
Post initial status update

Phase 2: Mitigate (5-30 minutes)

Apply runbook immediate actions
If no runbook exists: check recent deployments, check dependent services
Communicate every 15 minutes, even if no progress
Escalate if mitigation isn’t working by minute 20

Phase 3: Resolve (30+ minutes)

Root cause investigation
Implement fix or rollback
Verify resolution with metrics
Stand down incident

Phase 4: Follow-up (Next business day)

Write incident report (blameless)
Identify prevention actions
Update or create runbooks
Track action items to completion

Common Failure Mode Playbooks

Database Connection Pool Exhaustion

Symptoms: Increasing error rate, connection timeout errors, slow queries

Immediate Actions:

# Check active connections
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';

# Find long-running queries
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state != 'idle' AND (now() - pg_stat_activity.query_start) > interval '5 minutes';

# Kill problematic queries
SELECT pg_terminate_backend(pid);

Root Causes (in order of likelihood):

Connection leak in application code (missing connection.close())
Spike in traffic beyond pool capacity
Deadlock causing connections to hang
Database failover not updating connection strings

Memory Pressure / OOM Kills

Symptoms: Pod restarts, 137 exit codes, increasing memory usage trend

Immediate Actions:

# Check OOM events
dmesg | grep -i "out of memory" | tail -20

# Current memory usage by process
ps aux --sort=-%mem | head -20

# Kubernetes: check resource limits
kubectl describe pod <pod-name> | grep -A5 "Limits\|Requests"

Root Causes:

Memory leak (gradual increase over hours/days)
Cache unbounded growth
Large payload processing without streaming
Resource limits set too low for workload

Runbook Maintenance Protocol

Dead runbooks are worse than no runbooks — they create false confidence.

Review quarterly: Every runbook gets reviewed when its service changes
Test bi-annually: Actually execute runbook steps in staging
Update after incidents: Every incident that uses a runbook should update it
Assign ownership: Every runbook has a team owner, not an individual owner
Automate where possible: If a runbook step is always the same, script it

The gold standard: runbook steps that can be executed by a junior engineer at 3 AM with minimal prior context. If it requires tribal knowledge, it’s not a runbook — it’s a wish.