DevOps Incident Management & On-Call
Build sustainable on-call practices. Covers incident response frameworks, escalation policies, on-call rotations, runbooks, communication protocols, and reducing alert fatigue.
On-call is the most dreaded part of engineering culture because most organizations do it badly. Pages at 3 AM for non-urgent issues. Alerts with no runbooks. Escalation paths that dead-end. Weeks-long rotations that burn out engineers. Good on-call practices make the difference between a sustainable engineering culture and one that hemorrhages talent.
Incident Response Framework
DETECT → TRIAGE → RESPOND → RESOLVE → REVIEW
1. DETECT (automated)
Alert fires via monitoring/observability
2. TRIAGE (< 5 min)
On-call acknowledges, assess severity
3. RESPOND (severity-dependent)
SEV-1: Incident commander, war room, all-hands
SEV-2: On-call + relevant team
SEV-3: On-call handles solo
4. RESOLVE
Mitigate impact, then root cause fix
5. REVIEW
Blameless postmortem within 72 hours
Roles During Incidents
| Role | Responsibility |
|---|---|
| Incident Commander | Coordinates response, makes decisions, communicates status |
| Technical Lead | Leads investigation and mitigation |
| Communications | Updates status page, Slack channels, customers |
| Scribe | Documents timeline as events happen |
| Subject Matter Expert | Called in for specific system expertise |
On-Call Rotation
on_call:
rotation:
type: weekly
handoff: Monday 10:00 AM (during work hours)
team_size_minimum: 4 # No one on-call more than 25%
compensation:
on_call_pay: $500/week flat
page_bonus: $50 per page outside business hours
expectations:
response_time: 15 minutes
laptop_and_connectivity: required
alcohol_policy: "stay capable of clear-headed decisions"
wellbeing:
max_consecutive_weeks: 1
post_incident_cooldown: "If paged 3+ times overnight, take next morning off"
quarterly_review: "Review page frequency, reduce unnecessary alerts"
Communication Templates
## Incident Declared (Slack/Status Page)
🔴 **INCIDENT: [Service Name] — [Brief Description]**
**Severity:** SEV-1
**Impact:** [What users are experiencing]
**Start time:** [Timestamp]
**Incident Commander:** @name
**Status:** Investigating
Updates every 30 minutes.
## Incident Resolved
🟢 **RESOLVED: [Service Name] — [Brief Description]**
**Duration:** 47 minutes
**Root Cause:** [One-line summary]
**Impact:** [Number of users affected]
**Postmortem:** Scheduled for [date]
Anti-Patterns
| Anti-Pattern | Problem | Fix |
|---|---|---|
| No on-call compensation | Engineers resent on-call duty | Pay flat rate + per-page bonus |
| Pages for non-actionable alerts | Alert fatigue, pages ignored | Every page must have an action |
| Solo on-call for complex systems | Burnout, knowledge gap | Primary + secondary on-call |
| No handoff notes | New on-call unaware of state | Structured handoff with active issues |
| Postmortems only for SEV-1 | Miss learning from smaller incidents | Postmortem for SEV-1 and SEV-2 |
Checklist
- On-call rotation: minimum 4 people, weekly rotation
- Escalation policy: documented, tested
- Response time SLA: 15 min for SEV-1, 30 min for SEV-2
- Runbooks: every page alert has a runbook
- Status page: public, updated during incidents
- Communication: templates for declaration, updates, resolution
- Compensation: on-call pay + per-page bonus
- Postmortems: blameless, within 72 hours, action items tracked
:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For incident management consulting, visit garnetgrid.com. :::