ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

Site Reliability Engineering Runbook Framework

How to build production SRE runbooks. Covers incident response procedures, SLO-based alerting, error budget policies, and operational playbooks for common failure modes.

Runbooks are the difference between a 5-minute incident and a 5-hour outage. When your production system fails at 3 AM, the on-call engineer shouldn’t be debugging from scratch — they should be following a tested, documented procedure that leads to resolution. Every minute spent searching for context during an incident is a minute of customer impact.

Yet most organizations treat runbooks as an afterthought. They’re outdated Word documents in a forgotten SharePoint folder, written by someone who left the company two years ago. Production-grade SRE requires living runbooks that are maintained alongside the code they support.


Runbook Structure

Every runbook follows the same skeleton:

# [Service Name] — [Failure Mode]

## Severity: P1/P2/P3/P4

## Symptoms
- What the alert looks like
- What users are experiencing
- What dashboards show

## Immediate Actions (First 5 Minutes)
1. Step-by-step diagnostic commands
2. Common quick fixes
3. Escalation criteria

## Root Cause Investigation
- Diagnostic queries and commands
- Log locations and search patterns
- Common root causes in order of likelihood

## Resolution Steps
- Detailed fix procedures for each root cause
- Rollback instructions
- Verification steps

## Post-Incident
- What to check after resolution
- Metrics to monitor for recurrence
- Follow-up actions

SLO-Based Alerting

Traditional alerting (CPU > 90%, disk > 80%) generates noise. SLO-based alerting tells you when users are actually impacted.

Defining SLOs

ServiceSLISLOError Budget (30 days)
API GatewayRequest success rate99.9%43.2 minutes downtime
SearchP99 latency< 500ms0.1% of requests can be slow
PaymentsTransaction success rate99.99%4.3 minutes downtime
DashboardPage load time< 3s (P95)5% of loads can be slow

Error Budget Policy

The error budget is your innovation currency. When the budget is healthy, ship aggressively. When it’s depleted, freeze deployments and fix reliability.

Budget Remaining > 50%: Ship normally, deploy daily
Budget Remaining 25-50%: Slow deployments, extra review
Budget Remaining 10-25%: Critical fixes only, no new features
Budget Remaining < 10%: Deployment freeze, all hands on reliability

Incident Response Procedure

Phase 1: Detect & Triage (0-5 minutes)

  1. Acknowledge the alert
  2. Assess impact scope (how many users, which regions)
  3. Assign severity level
  4. Open incident channel (#inc-YYYYMMDD-brief-description)
  5. Post initial status update

Phase 2: Mitigate (5-30 minutes)

  1. Apply runbook immediate actions
  2. If no runbook exists: check recent deployments, check dependent services
  3. Communicate every 15 minutes, even if no progress
  4. Escalate if mitigation isn’t working by minute 20

Phase 3: Resolve (30+ minutes)

  1. Root cause investigation
  2. Implement fix or rollback
  3. Verify resolution with metrics
  4. Stand down incident

Phase 4: Follow-up (Next business day)

  1. Write incident report (blameless)
  2. Identify prevention actions
  3. Update or create runbooks
  4. Track action items to completion

Common Failure Mode Playbooks

Database Connection Pool Exhaustion

Symptoms: Increasing error rate, connection timeout errors, slow queries

Immediate Actions:

# Check active connections
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';

# Find long-running queries
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state != 'idle' AND (now() - pg_stat_activity.query_start) > interval '5 minutes';

# Kill problematic queries
SELECT pg_terminate_backend(pid);

Root Causes (in order of likelihood):

  1. Connection leak in application code (missing connection.close())
  2. Spike in traffic beyond pool capacity
  3. Deadlock causing connections to hang
  4. Database failover not updating connection strings

Memory Pressure / OOM Kills

Symptoms: Pod restarts, 137 exit codes, increasing memory usage trend

Immediate Actions:

# Check OOM events
dmesg | grep -i "out of memory" | tail -20

# Current memory usage by process
ps aux --sort=-%mem | head -20

# Kubernetes: check resource limits
kubectl describe pod <pod-name> | grep -A5 "Limits\|Requests"

Root Causes:

  1. Memory leak (gradual increase over hours/days)
  2. Cache unbounded growth
  3. Large payload processing without streaming
  4. Resource limits set too low for workload

Runbook Maintenance Protocol

Dead runbooks are worse than no runbooks — they create false confidence.

  1. Review quarterly: Every runbook gets reviewed when its service changes
  2. Test bi-annually: Actually execute runbook steps in staging
  3. Update after incidents: Every incident that uses a runbook should update it
  4. Assign ownership: Every runbook has a team owner, not an individual owner
  5. Automate where possible: If a runbook step is always the same, script it

The gold standard: runbook steps that can be executed by a junior engineer at 3 AM with minimal prior context. If it requires tribal knowledge, it’s not a runbook — it’s a wish.

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →