Site Reliability Engineering Runbook Framework
How to build production SRE runbooks. Covers incident response procedures, SLO-based alerting, error budget policies, and operational playbooks for common failure modes.
Runbooks are the difference between a 5-minute incident and a 5-hour outage. When your production system fails at 3 AM, the on-call engineer shouldn’t be debugging from scratch — they should be following a tested, documented procedure that leads to resolution. Every minute spent searching for context during an incident is a minute of customer impact.
Yet most organizations treat runbooks as an afterthought. They’re outdated Word documents in a forgotten SharePoint folder, written by someone who left the company two years ago. Production-grade SRE requires living runbooks that are maintained alongside the code they support.
Runbook Structure
Every runbook follows the same skeleton:
# [Service Name] — [Failure Mode]
## Severity: P1/P2/P3/P4
## Symptoms
- What the alert looks like
- What users are experiencing
- What dashboards show
## Immediate Actions (First 5 Minutes)
1. Step-by-step diagnostic commands
2. Common quick fixes
3. Escalation criteria
## Root Cause Investigation
- Diagnostic queries and commands
- Log locations and search patterns
- Common root causes in order of likelihood
## Resolution Steps
- Detailed fix procedures for each root cause
- Rollback instructions
- Verification steps
## Post-Incident
- What to check after resolution
- Metrics to monitor for recurrence
- Follow-up actions
SLO-Based Alerting
Traditional alerting (CPU > 90%, disk > 80%) generates noise. SLO-based alerting tells you when users are actually impacted.
Defining SLOs
| Service | SLI | SLO | Error Budget (30 days) |
|---|---|---|---|
| API Gateway | Request success rate | 99.9% | 43.2 minutes downtime |
| Search | P99 latency | < 500ms | 0.1% of requests can be slow |
| Payments | Transaction success rate | 99.99% | 4.3 minutes downtime |
| Dashboard | Page load time | < 3s (P95) | 5% of loads can be slow |
Error Budget Policy
The error budget is your innovation currency. When the budget is healthy, ship aggressively. When it’s depleted, freeze deployments and fix reliability.
Budget Remaining > 50%: Ship normally, deploy daily
Budget Remaining 25-50%: Slow deployments, extra review
Budget Remaining 10-25%: Critical fixes only, no new features
Budget Remaining < 10%: Deployment freeze, all hands on reliability
Incident Response Procedure
Phase 1: Detect & Triage (0-5 minutes)
- Acknowledge the alert
- Assess impact scope (how many users, which regions)
- Assign severity level
- Open incident channel (
#inc-YYYYMMDD-brief-description) - Post initial status update
Phase 2: Mitigate (5-30 minutes)
- Apply runbook immediate actions
- If no runbook exists: check recent deployments, check dependent services
- Communicate every 15 minutes, even if no progress
- Escalate if mitigation isn’t working by minute 20
Phase 3: Resolve (30+ minutes)
- Root cause investigation
- Implement fix or rollback
- Verify resolution with metrics
- Stand down incident
Phase 4: Follow-up (Next business day)
- Write incident report (blameless)
- Identify prevention actions
- Update or create runbooks
- Track action items to completion
Common Failure Mode Playbooks
Database Connection Pool Exhaustion
Symptoms: Increasing error rate, connection timeout errors, slow queries
Immediate Actions:
# Check active connections
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';
# Find long-running queries
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state != 'idle' AND (now() - pg_stat_activity.query_start) > interval '5 minutes';
# Kill problematic queries
SELECT pg_terminate_backend(pid);
Root Causes (in order of likelihood):
- Connection leak in application code (missing connection.close())
- Spike in traffic beyond pool capacity
- Deadlock causing connections to hang
- Database failover not updating connection strings
Memory Pressure / OOM Kills
Symptoms: Pod restarts, 137 exit codes, increasing memory usage trend
Immediate Actions:
# Check OOM events
dmesg | grep -i "out of memory" | tail -20
# Current memory usage by process
ps aux --sort=-%mem | head -20
# Kubernetes: check resource limits
kubectl describe pod <pod-name> | grep -A5 "Limits\|Requests"
Root Causes:
- Memory leak (gradual increase over hours/days)
- Cache unbounded growth
- Large payload processing without streaming
- Resource limits set too low for workload
Runbook Maintenance Protocol
Dead runbooks are worse than no runbooks — they create false confidence.
- Review quarterly: Every runbook gets reviewed when its service changes
- Test bi-annually: Actually execute runbook steps in staging
- Update after incidents: Every incident that uses a runbook should update it
- Assign ownership: Every runbook has a team owner, not an individual owner
- Automate where possible: If a runbook step is always the same, script it
The gold standard: runbook steps that can be executed by a junior engineer at 3 AM with minimal prior context. If it requires tribal knowledge, it’s not a runbook — it’s a wish.