How to Build an Effective Incident Response Playbook

Organizations with tested incident response plans save an average of $2.66M per breach compared to those without (IBM Cost of a Data Breach Report, 2024). The plan isn’t the document — it’s the rehearsal. A perfectly written playbook that nobody has practiced is marginally better than no playbook at all.

This guide covers everything you need to build, implement, and test an incident response program: severity classification, on-call structure, playbook templates, communication protocols, post-mortem frameworks, and the tabletop exercises that turn paper plans into muscle memory.

Step 1: Severity Classification

Clear severity definitions prevent the two most common incident response failures: under-reacting to critical issues and over-reacting to non-events. Every team member should be able to classify an incident without asking a manager.

Severity	Definition	Response Time	Escalation	Example
SEV-1 (Critical)	Revenue-impacting, data breach, full outage, or customer data at risk	15 minutes	VP Engineering + CISO immediately	Payment system down, data exfiltrated, full site outage
SEV-2 (High)	Partial outage, significant degraded performance, one service down	30 minutes	Engineering Manager	API error rate > 10%, one region down, auth failures
SEV-3 (Medium)	Minor impact, workaround available, performance degraded	2 hours	Team lead	Slow queries, non-critical service degraded, UI glitches
SEV-4 (Low)	No user impact, internal observation only	Next business day	Log in backlog	Monitoring alert anomaly, non-prod issue, cosmetic bug

Severity Decision Tree

Is customer data compromised or at risk?
├── YES → SEV-1 (immediately)
└── NO → Continue

Is the service completely unavailable?
├── YES → SEV-1
└── NO → Continue

Are > 10% of users affected?
├── YES → SEV-2
└── NO → Continue

Is there a workaround?
├── NO → SEV-2
├── YES, but painful → SEV-3
└── YES, trivial → SEV-4

Step 2: On-Call Structure

Rotation Design

# PagerDuty / Opsgenie schedule structure
on_call_rotation:
  primary:
    schedule: weekly_rotation
    members: [engineer_a, engineer_b, engineer_c, engineer_d]
    handoff: monday_10am_local
    escalation:
      - wait: 5_minutes
        target: secondary_on_call
      - wait: 10_minutes
        target: engineering_manager
      - wait: 15_minutes
        target: vp_engineering

  secondary:
    schedule: weekly_rotation (offset by 1 week from primary)
    purpose: backup if primary doesn't acknowledge

  incident_commander_pool:
    - senior_engineer_1
    - senior_engineer_2
    - engineering_manager_1
    - engineering_manager_2

  roles:
    incident_commander: "Coordinates response, makes decisions, owns timeline"
    communications_lead: "Updates stakeholders, customers, status page"
    technical_lead: "Directs debugging and remediation, assigns tasks"
    scribe: "Documents timeline, decisions, and action items in real-time"

On-Call Best Practices

Practice	Why
Maximum 1 week on-call, 3 weeks off	Prevents burnout — on-call is stressful
Compensate on-call (stipend or comp time)	People who are paid for on-call take it seriously
Provide a company laptop with VPN	Don’t require personal devices for incident response
No on-call during vacations	Override scheduling tools to respect PTO
Maximum 2 pages per on-call shift (goal)	More than 2/week = fix the root cause, not the alert
Shadow shifts for new team members	Pair with experienced on-call before going solo

Step 3: Response Playbook Template

Every common incident type should have a specific playbook. Start with your top 5 failure modes.

## Incident Playbook: [Incident Type — e.g., Database Connection Exhaustion]

### Detection
- Alert source: [PagerDuty alert name / Datadog monitor name]
- Initial indicators: [what metrics or errors trigger this playbook]
- Common causes: [deployment, traffic spike, connection leak, dependency failure]

### Triage (First 15 Minutes)
1. Acknowledge alert in PagerDuty (stops escalation timer)
2. Join war room: [#incident-response Slack channel / Zoom link]
3. Assign roles: IC, Comms Lead, Tech Lead, Scribe
4. Assess severity using classification matrix
5. Start incident document from template: [link to template]
6. Post initial update in #incidents channel

### Diagnosis
1. Check dashboards: [specific links to Grafana/Datadog dashboards]
2. Check recent deployments: [CI/CD deployment history link]
3. Check infrastructure health: [CloudWatch / Kubernetes dashboard]
4. Run diagnostic commands:
   ```bash
   # Check pod status and recent restarts
   kubectl get pods -n production -o wide
   kubectl top pods -n production

   # Check application logs for errors
   kubectl logs -n production -l app=api --tail=200 --since=15m | grep -i error

   # Check API health endpoint
   curl -s https://api.company.com/health | jq .

   # Check database connections
   kubectl exec -n production deployment/api -- \
     psql -c "SELECT count(*) FROM pg_stat_activity;"

Remediation Options

Option A: Rollback last deployment (if issue started after a deploy)

kubectl rollout undo deployment/api -n production
# Verify rollback
kubectl rollout status deployment/api -n production --timeout=120s

Option B: Scale up to handle load (if traffic spike is the cause)

kubectl scale deployment/api --replicas=10 -n production
# Monitor: HPA will scale down automatically when load drops

Option C: Restart pods (if connection leak or memory issue)

kubectl rollout restart deployment/api -n production

Option D: Failover to disaster recovery (SEV-1 only, when primary is unrecoverable)

./scripts/failover-to-dr.sh --region us-west-2 --confirm

Communication Templates

Internal (Slack #incidents):

🔴 SEV-[X] Incident: [Title] Impact: [what users are experiencing] IC: @[name] | Tech Lead: @[name] Status: Investigating | Identified | Monitoring | Resolved Next update: [time — commit to an update cadence, e.g., every 15 min for SEV-1]

External (Status Page — statuspage.io / Atlassian):

We are currently experiencing [issue description]. This impacts [affected services/features]. Our team is actively working on resolution. We will provide an update by [time].

Executive Update (Email / Slack DM — SEV-1 only):

Subject: SEV-1 Incident Update — [Title] Current status: [Investigating/Identified/Monitoring/Resolved] Customer impact: [X users affected, Y% error rate, $Z revenue impact] ETA to resolution: [best estimate or “investigating”] Next update: [time]

Resolution

Verify service is healthy (all dashboard metrics green for 15 minutes)
Monitor for regression for 30 minutes post-fix
Update status page → Resolved
Post final update in #incidents with summary
Schedule post-mortem within 48 hours (non-negotiable for SEV-1/SEV-2)


---

## Step 4: Post-Mortem Framework

Post-mortems are blameless. The goal is to prevent recurrence, not assign blame. If your culture punishes people for incidents, people will hide incidents.

```markdown
## Post-Mortem: [Incident Title]

**Date:** [date of incident]
**Duration:** [start time – end time (total: X minutes)]
**Severity:** SEV-[X]
**Incident Commander:** [name]
**Author:** [who wrote this post-mortem]

### Summary
[2-3 sentence description of what happened, what was impacted, and how it was resolved]

### Timeline
| Time (UTC) | Event |
|---|---|
| 14:02 | Alert triggered: API error rate > 5% (PagerDuty) |
| 14:05 | On-call acknowledged, created #inc-20250215 Slack channel |
| 14:08 | IC assigned, roles distributed |
| 14:12 | Root cause identified: database connection pool exhausted after deploy |
| 14:15 | Rollback initiated: kubectl rollout undo |
| 14:18 | Fix applied: connection pool increased from 20 to 100 |
| 14:25 | Service recovered, monitoring period started |
| 14:45 | Monitoring clear — all metrics nominal |
| 14:55 | Incident resolved, status page updated |

### Root Cause
[Detailed technical explanation — what specifically broke and why. Include code references if relevant]

### Contributing Factors
[What conditions made this incident possible? Missing tests? Missing monitoring? Configuration gap?]

### Impact
- [X] customers affected (Y% of total traffic)
- [Z] minutes of degraded service
- $[amount] estimated revenue impact
- [X] support tickets generated

### What Went Well
- Alert fired within 2 minutes of impact starting
- War room assembled in 5 minutes (under our 15-min target)
- Root cause identified in 10 minutes (team knew where to look)
- Fix applied in 15 minutes

### What Could Be Improved
- Connection pool limits weren't monitored (no alert existed)
- Runbook didn't cover this specific failure mode
- Customer communication was delayed by 20 minutes
- The deploy that caused this had no canary phase

### Action Items
| Action | Owner | Due Date | Priority |
|---|---|---|---|
| Add connection pool monitoring alert | @engineer | [date] | P1 |
| Update runbook with DB pool exhaustion playbook | @sre | [date] | P1 |
| Implement canary deployments for API service | @platform | [date] | P2 |
| Load test with 2x normal traffic | @qa | [date] | P2 |
| Automate customer status page notification | @platform | [date] | P3 |

Step 5: Tabletop Exercises

Run simulated incidents quarterly. This is the single highest-value activity in incident response.

Exercise Type	Frequency	Duration	Participants
Walk-through (review playbook verbally)	Monthly	30 minutes	On-call rotation
Tabletop (simulated scenario, discuss response)	Quarterly	1-2 hours	Engineering + management
Game Day (inject real failure in staging)	Semi-annually	Half day	Full engineering team
Chaos Engineering (automated failure injection)	Ongoing	Automated	Production (with safeguards)

Incident Response Checklist

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For SRE consulting, visit garnetgrid.com. :::