Verified by Garnet Grid

How to Build an Effective Incident Response Playbook

Build and test incident response playbooks for your team. Covers severity classification, communication templates, war room procedures, and post-mortem frameworks.

Organizations with tested incident response plans save an average of $2.66M per breach compared to those without (IBM Cost of a Data Breach Report, 2024). The plan isn’t the document — it’s the rehearsal. A perfectly written playbook that nobody has practiced is marginally better than no playbook at all.

This guide covers everything you need to build, implement, and test an incident response program: severity classification, on-call structure, playbook templates, communication protocols, post-mortem frameworks, and the tabletop exercises that turn paper plans into muscle memory.


Step 1: Severity Classification

Clear severity definitions prevent the two most common incident response failures: under-reacting to critical issues and over-reacting to non-events. Every team member should be able to classify an incident without asking a manager.

SeverityDefinitionResponse TimeEscalationExample
SEV-1 (Critical)Revenue-impacting, data breach, full outage, or customer data at risk15 minutesVP Engineering + CISO immediatelyPayment system down, data exfiltrated, full site outage
SEV-2 (High)Partial outage, significant degraded performance, one service down30 minutesEngineering ManagerAPI error rate > 10%, one region down, auth failures
SEV-3 (Medium)Minor impact, workaround available, performance degraded2 hoursTeam leadSlow queries, non-critical service degraded, UI glitches
SEV-4 (Low)No user impact, internal observation onlyNext business dayLog in backlogMonitoring alert anomaly, non-prod issue, cosmetic bug

Severity Decision Tree

Is customer data compromised or at risk?
├── YES → SEV-1 (immediately)
└── NO → Continue

Is the service completely unavailable?
├── YES → SEV-1
└── NO → Continue

Are > 10% of users affected?
├── YES → SEV-2
└── NO → Continue

Is there a workaround?
├── NO → SEV-2
├── YES, but painful → SEV-3
└── YES, trivial → SEV-4

Step 2: On-Call Structure

Rotation Design

# PagerDuty / Opsgenie schedule structure
on_call_rotation:
  primary:
    schedule: weekly_rotation
    members: [engineer_a, engineer_b, engineer_c, engineer_d]
    handoff: monday_10am_local
    escalation:
      - wait: 5_minutes
        target: secondary_on_call
      - wait: 10_minutes
        target: engineering_manager
      - wait: 15_minutes
        target: vp_engineering

  secondary:
    schedule: weekly_rotation (offset by 1 week from primary)
    purpose: backup if primary doesn't acknowledge

  incident_commander_pool:
    - senior_engineer_1
    - senior_engineer_2
    - engineering_manager_1
    - engineering_manager_2

  roles:
    incident_commander: "Coordinates response, makes decisions, owns timeline"
    communications_lead: "Updates stakeholders, customers, status page"
    technical_lead: "Directs debugging and remediation, assigns tasks"
    scribe: "Documents timeline, decisions, and action items in real-time"

On-Call Best Practices

PracticeWhy
Maximum 1 week on-call, 3 weeks offPrevents burnout — on-call is stressful
Compensate on-call (stipend or comp time)People who are paid for on-call take it seriously
Provide a company laptop with VPNDon’t require personal devices for incident response
No on-call during vacationsOverride scheduling tools to respect PTO
Maximum 2 pages per on-call shift (goal)More than 2/week = fix the root cause, not the alert
Shadow shifts for new team membersPair with experienced on-call before going solo

Step 3: Response Playbook Template

Every common incident type should have a specific playbook. Start with your top 5 failure modes.

## Incident Playbook: [Incident Type — e.g., Database Connection Exhaustion]

### Detection
- Alert source: [PagerDuty alert name / Datadog monitor name]
- Initial indicators: [what metrics or errors trigger this playbook]
- Common causes: [deployment, traffic spike, connection leak, dependency failure]

### Triage (First 15 Minutes)
1. Acknowledge alert in PagerDuty (stops escalation timer)
2. Join war room: [#incident-response Slack channel / Zoom link]
3. Assign roles: IC, Comms Lead, Tech Lead, Scribe
4. Assess severity using classification matrix
5. Start incident document from template: [link to template]
6. Post initial update in #incidents channel

### Diagnosis
1. Check dashboards: [specific links to Grafana/Datadog dashboards]
2. Check recent deployments: [CI/CD deployment history link]
3. Check infrastructure health: [CloudWatch / Kubernetes dashboard]
4. Run diagnostic commands:
   ```bash
   # Check pod status and recent restarts
   kubectl get pods -n production -o wide
   kubectl top pods -n production

   # Check application logs for errors
   kubectl logs -n production -l app=api --tail=200 --since=15m | grep -i error

   # Check API health endpoint
   curl -s https://api.company.com/health | jq .

   # Check database connections
   kubectl exec -n production deployment/api -- \
     psql -c "SELECT count(*) FROM pg_stat_activity;"

Remediation Options

Option A: Rollback last deployment (if issue started after a deploy)

kubectl rollout undo deployment/api -n production
# Verify rollback
kubectl rollout status deployment/api -n production --timeout=120s

Option B: Scale up to handle load (if traffic spike is the cause)

kubectl scale deployment/api --replicas=10 -n production
# Monitor: HPA will scale down automatically when load drops

Option C: Restart pods (if connection leak or memory issue)

kubectl rollout restart deployment/api -n production

Option D: Failover to disaster recovery (SEV-1 only, when primary is unrecoverable)

./scripts/failover-to-dr.sh --region us-west-2 --confirm

Communication Templates

Internal (Slack #incidents):

🔴 SEV-[X] Incident: [Title] Impact: [what users are experiencing] IC: @[name] | Tech Lead: @[name] Status: Investigating | Identified | Monitoring | Resolved Next update: [time — commit to an update cadence, e.g., every 15 min for SEV-1]

External (Status Page — statuspage.io / Atlassian):

We are currently experiencing [issue description]. This impacts [affected services/features]. Our team is actively working on resolution. We will provide an update by [time].

Executive Update (Email / Slack DM — SEV-1 only):

Subject: SEV-1 Incident Update — [Title] Current status: [Investigating/Identified/Monitoring/Resolved] Customer impact: [X users affected, Y% error rate, $Z revenue impact] ETA to resolution: [best estimate or “investigating”] Next update: [time]

Resolution

  1. Verify service is healthy (all dashboard metrics green for 15 minutes)
  2. Monitor for regression for 30 minutes post-fix
  3. Update status page → Resolved
  4. Post final update in #incidents with summary
  5. Schedule post-mortem within 48 hours (non-negotiable for SEV-1/SEV-2)

---

## Step 4: Post-Mortem Framework

Post-mortems are blameless. The goal is to prevent recurrence, not assign blame. If your culture punishes people for incidents, people will hide incidents.

```markdown
## Post-Mortem: [Incident Title]

**Date:** [date of incident]
**Duration:** [start time – end time (total: X minutes)]
**Severity:** SEV-[X]
**Incident Commander:** [name]
**Author:** [who wrote this post-mortem]

### Summary
[2-3 sentence description of what happened, what was impacted, and how it was resolved]

### Timeline
| Time (UTC) | Event |
|---|---|
| 14:02 | Alert triggered: API error rate > 5% (PagerDuty) |
| 14:05 | On-call acknowledged, created #inc-20250215 Slack channel |
| 14:08 | IC assigned, roles distributed |
| 14:12 | Root cause identified: database connection pool exhausted after deploy |
| 14:15 | Rollback initiated: kubectl rollout undo |
| 14:18 | Fix applied: connection pool increased from 20 to 100 |
| 14:25 | Service recovered, monitoring period started |
| 14:45 | Monitoring clear — all metrics nominal |
| 14:55 | Incident resolved, status page updated |

### Root Cause
[Detailed technical explanation — what specifically broke and why. Include code references if relevant]

### Contributing Factors
[What conditions made this incident possible? Missing tests? Missing monitoring? Configuration gap?]

### Impact
- [X] customers affected (Y% of total traffic)
- [Z] minutes of degraded service
- $[amount] estimated revenue impact
- [X] support tickets generated

### What Went Well
- Alert fired within 2 minutes of impact starting
- War room assembled in 5 minutes (under our 15-min target)
- Root cause identified in 10 minutes (team knew where to look)
- Fix applied in 15 minutes

### What Could Be Improved
- Connection pool limits weren't monitored (no alert existed)
- Runbook didn't cover this specific failure mode
- Customer communication was delayed by 20 minutes
- The deploy that caused this had no canary phase

### Action Items
| Action | Owner | Due Date | Priority |
|---|---|---|---|
| Add connection pool monitoring alert | @engineer | [date] | P1 |
| Update runbook with DB pool exhaustion playbook | @sre | [date] | P1 |
| Implement canary deployments for API service | @platform | [date] | P2 |
| Load test with 2x normal traffic | @qa | [date] | P2 |
| Automate customer status page notification | @platform | [date] | P3 |

Step 5: Tabletop Exercises

Run simulated incidents quarterly. This is the single highest-value activity in incident response.

Exercise TypeFrequencyDurationParticipants
Walk-through (review playbook verbally)Monthly30 minutesOn-call rotation
Tabletop (simulated scenario, discuss response)Quarterly1-2 hoursEngineering + management
Game Day (inject real failure in staging)Semi-annuallyHalf dayFull engineering team
Chaos Engineering (automated failure injection)OngoingAutomatedProduction (with safeguards)

Incident Response Checklist

  • Severity classification defined, documented, and trained (all engineers can classify)
  • On-call rotation configured with escalation paths (PagerDuty/Opsgenie)
  • War room channel and bridge set up (Slack channel + Zoom link)
  • Playbooks written for top 5 incident types
  • Communication templates drafted (internal, external, executive)
  • Status page configured (statuspage.io, Atlassian, Instatus)
  • Post-mortem template standardized and blameless culture established
  • Tabletop exercises scheduled quarterly (with scenarios pre-written)
  • Action items from post-mortems tracked to completion (not abandoned)
  • Metrics tracked: MTTA, MTTR, incidents per month, action item close rate
  • On-call compensation established (stipend, comp time, or both)

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For SRE consulting, visit garnetgrid.com. :::

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →