How to Build an Effective Incident Response Playbook
Build and test incident response playbooks for your team. Covers severity classification, communication templates, war room procedures, and post-mortem frameworks.
Organizations with tested incident response plans save an average of $2.66M per breach compared to those without (IBM Cost of a Data Breach Report, 2024). The plan isn’t the document — it’s the rehearsal. A perfectly written playbook that nobody has practiced is marginally better than no playbook at all.
This guide covers everything you need to build, implement, and test an incident response program: severity classification, on-call structure, playbook templates, communication protocols, post-mortem frameworks, and the tabletop exercises that turn paper plans into muscle memory.
Step 1: Severity Classification
Clear severity definitions prevent the two most common incident response failures: under-reacting to critical issues and over-reacting to non-events. Every team member should be able to classify an incident without asking a manager.
| Severity | Definition | Response Time | Escalation | Example |
|---|---|---|---|---|
| SEV-1 (Critical) | Revenue-impacting, data breach, full outage, or customer data at risk | 15 minutes | VP Engineering + CISO immediately | Payment system down, data exfiltrated, full site outage |
| SEV-2 (High) | Partial outage, significant degraded performance, one service down | 30 minutes | Engineering Manager | API error rate > 10%, one region down, auth failures |
| SEV-3 (Medium) | Minor impact, workaround available, performance degraded | 2 hours | Team lead | Slow queries, non-critical service degraded, UI glitches |
| SEV-4 (Low) | No user impact, internal observation only | Next business day | Log in backlog | Monitoring alert anomaly, non-prod issue, cosmetic bug |
Severity Decision Tree
Is customer data compromised or at risk?
├── YES → SEV-1 (immediately)
└── NO → Continue
Is the service completely unavailable?
├── YES → SEV-1
└── NO → Continue
Are > 10% of users affected?
├── YES → SEV-2
└── NO → Continue
Is there a workaround?
├── NO → SEV-2
├── YES, but painful → SEV-3
└── YES, trivial → SEV-4
Step 2: On-Call Structure
Rotation Design
# PagerDuty / Opsgenie schedule structure
on_call_rotation:
primary:
schedule: weekly_rotation
members: [engineer_a, engineer_b, engineer_c, engineer_d]
handoff: monday_10am_local
escalation:
- wait: 5_minutes
target: secondary_on_call
- wait: 10_minutes
target: engineering_manager
- wait: 15_minutes
target: vp_engineering
secondary:
schedule: weekly_rotation (offset by 1 week from primary)
purpose: backup if primary doesn't acknowledge
incident_commander_pool:
- senior_engineer_1
- senior_engineer_2
- engineering_manager_1
- engineering_manager_2
roles:
incident_commander: "Coordinates response, makes decisions, owns timeline"
communications_lead: "Updates stakeholders, customers, status page"
technical_lead: "Directs debugging and remediation, assigns tasks"
scribe: "Documents timeline, decisions, and action items in real-time"
On-Call Best Practices
| Practice | Why |
|---|---|
| Maximum 1 week on-call, 3 weeks off | Prevents burnout — on-call is stressful |
| Compensate on-call (stipend or comp time) | People who are paid for on-call take it seriously |
| Provide a company laptop with VPN | Don’t require personal devices for incident response |
| No on-call during vacations | Override scheduling tools to respect PTO |
| Maximum 2 pages per on-call shift (goal) | More than 2/week = fix the root cause, not the alert |
| Shadow shifts for new team members | Pair with experienced on-call before going solo |
Step 3: Response Playbook Template
Every common incident type should have a specific playbook. Start with your top 5 failure modes.
## Incident Playbook: [Incident Type — e.g., Database Connection Exhaustion]
### Detection
- Alert source: [PagerDuty alert name / Datadog monitor name]
- Initial indicators: [what metrics or errors trigger this playbook]
- Common causes: [deployment, traffic spike, connection leak, dependency failure]
### Triage (First 15 Minutes)
1. Acknowledge alert in PagerDuty (stops escalation timer)
2. Join war room: [#incident-response Slack channel / Zoom link]
3. Assign roles: IC, Comms Lead, Tech Lead, Scribe
4. Assess severity using classification matrix
5. Start incident document from template: [link to template]
6. Post initial update in #incidents channel
### Diagnosis
1. Check dashboards: [specific links to Grafana/Datadog dashboards]
2. Check recent deployments: [CI/CD deployment history link]
3. Check infrastructure health: [CloudWatch / Kubernetes dashboard]
4. Run diagnostic commands:
```bash
# Check pod status and recent restarts
kubectl get pods -n production -o wide
kubectl top pods -n production
# Check application logs for errors
kubectl logs -n production -l app=api --tail=200 --since=15m | grep -i error
# Check API health endpoint
curl -s https://api.company.com/health | jq .
# Check database connections
kubectl exec -n production deployment/api -- \
psql -c "SELECT count(*) FROM pg_stat_activity;"
Remediation Options
Option A: Rollback last deployment (if issue started after a deploy)
kubectl rollout undo deployment/api -n production
# Verify rollback
kubectl rollout status deployment/api -n production --timeout=120s
Option B: Scale up to handle load (if traffic spike is the cause)
kubectl scale deployment/api --replicas=10 -n production
# Monitor: HPA will scale down automatically when load drops
Option C: Restart pods (if connection leak or memory issue)
kubectl rollout restart deployment/api -n production
Option D: Failover to disaster recovery (SEV-1 only, when primary is unrecoverable)
./scripts/failover-to-dr.sh --region us-west-2 --confirm
Communication Templates
Internal (Slack #incidents):
🔴 SEV-[X] Incident: [Title] Impact: [what users are experiencing] IC: @[name] | Tech Lead: @[name] Status: Investigating | Identified | Monitoring | Resolved Next update: [time — commit to an update cadence, e.g., every 15 min for SEV-1]
External (Status Page — statuspage.io / Atlassian):
We are currently experiencing [issue description]. This impacts [affected services/features]. Our team is actively working on resolution. We will provide an update by [time].
Executive Update (Email / Slack DM — SEV-1 only):
Subject: SEV-1 Incident Update — [Title] Current status: [Investigating/Identified/Monitoring/Resolved] Customer impact: [X users affected, Y% error rate, $Z revenue impact] ETA to resolution: [best estimate or “investigating”] Next update: [time]
Resolution
- Verify service is healthy (all dashboard metrics green for 15 minutes)
- Monitor for regression for 30 minutes post-fix
- Update status page → Resolved
- Post final update in #incidents with summary
- Schedule post-mortem within 48 hours (non-negotiable for SEV-1/SEV-2)
---
## Step 4: Post-Mortem Framework
Post-mortems are blameless. The goal is to prevent recurrence, not assign blame. If your culture punishes people for incidents, people will hide incidents.
```markdown
## Post-Mortem: [Incident Title]
**Date:** [date of incident]
**Duration:** [start time – end time (total: X minutes)]
**Severity:** SEV-[X]
**Incident Commander:** [name]
**Author:** [who wrote this post-mortem]
### Summary
[2-3 sentence description of what happened, what was impacted, and how it was resolved]
### Timeline
| Time (UTC) | Event |
|---|---|
| 14:02 | Alert triggered: API error rate > 5% (PagerDuty) |
| 14:05 | On-call acknowledged, created #inc-20250215 Slack channel |
| 14:08 | IC assigned, roles distributed |
| 14:12 | Root cause identified: database connection pool exhausted after deploy |
| 14:15 | Rollback initiated: kubectl rollout undo |
| 14:18 | Fix applied: connection pool increased from 20 to 100 |
| 14:25 | Service recovered, monitoring period started |
| 14:45 | Monitoring clear — all metrics nominal |
| 14:55 | Incident resolved, status page updated |
### Root Cause
[Detailed technical explanation — what specifically broke and why. Include code references if relevant]
### Contributing Factors
[What conditions made this incident possible? Missing tests? Missing monitoring? Configuration gap?]
### Impact
- [X] customers affected (Y% of total traffic)
- [Z] minutes of degraded service
- $[amount] estimated revenue impact
- [X] support tickets generated
### What Went Well
- Alert fired within 2 minutes of impact starting
- War room assembled in 5 minutes (under our 15-min target)
- Root cause identified in 10 minutes (team knew where to look)
- Fix applied in 15 minutes
### What Could Be Improved
- Connection pool limits weren't monitored (no alert existed)
- Runbook didn't cover this specific failure mode
- Customer communication was delayed by 20 minutes
- The deploy that caused this had no canary phase
### Action Items
| Action | Owner | Due Date | Priority |
|---|---|---|---|
| Add connection pool monitoring alert | @engineer | [date] | P1 |
| Update runbook with DB pool exhaustion playbook | @sre | [date] | P1 |
| Implement canary deployments for API service | @platform | [date] | P2 |
| Load test with 2x normal traffic | @qa | [date] | P2 |
| Automate customer status page notification | @platform | [date] | P3 |
Step 5: Tabletop Exercises
Run simulated incidents quarterly. This is the single highest-value activity in incident response.
| Exercise Type | Frequency | Duration | Participants |
|---|---|---|---|
| Walk-through (review playbook verbally) | Monthly | 30 minutes | On-call rotation |
| Tabletop (simulated scenario, discuss response) | Quarterly | 1-2 hours | Engineering + management |
| Game Day (inject real failure in staging) | Semi-annually | Half day | Full engineering team |
| Chaos Engineering (automated failure injection) | Ongoing | Automated | Production (with safeguards) |
Incident Response Checklist
- Severity classification defined, documented, and trained (all engineers can classify)
- On-call rotation configured with escalation paths (PagerDuty/Opsgenie)
- War room channel and bridge set up (Slack channel + Zoom link)
- Playbooks written for top 5 incident types
- Communication templates drafted (internal, external, executive)
- Status page configured (statuspage.io, Atlassian, Instatus)
- Post-mortem template standardized and blameless culture established
- Tabletop exercises scheduled quarterly (with scenarios pre-written)
- Action items from post-mortems tracked to completion (not abandoned)
- Metrics tracked: MTTA, MTTR, incidents per month, action item close rate
- On-call compensation established (stipend, comp time, or both)
:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For SRE consulting, visit garnetgrid.com. :::