DevOps Incident Management & On-Call

On-call is the most dreaded part of engineering culture because most organizations do it badly. Pages at 3 AM for non-urgent issues. Alerts with no runbooks. Escalation paths that dead-end. Weeks-long rotations that burn out engineers. Good on-call practices make the difference between a sustainable engineering culture and one that hemorrhages talent.

Incident Response Framework

DETECT → TRIAGE → RESPOND → RESOLVE → REVIEW

1. DETECT (automated)
   Alert fires via monitoring/observability
   
2. TRIAGE (< 5 min)
   On-call acknowledges, assess severity
   
3. RESPOND (severity-dependent)
   SEV-1: Incident commander, war room, all-hands
   SEV-2: On-call + relevant team
   SEV-3: On-call handles solo
   
4. RESOLVE
   Mitigate impact, then root cause fix
   
5. REVIEW
   Blameless postmortem within 72 hours

Roles During Incidents

Role	Responsibility
Incident Commander	Coordinates response, makes decisions, communicates status
Technical Lead	Leads investigation and mitigation
Communications	Updates status page, Slack channels, customers
Scribe	Documents timeline as events happen
Subject Matter Expert	Called in for specific system expertise

On-Call Rotation

on_call:
  rotation:
    type: weekly
    handoff: Monday 10:00 AM (during work hours)
    
  team_size_minimum: 4  # No one on-call more than 25%
  
  compensation:
    on_call_pay: $500/week flat
    page_bonus: $50 per page outside business hours
    
  expectations:
    response_time: 15 minutes
    laptop_and_connectivity: required
    alcohol_policy: "stay capable of clear-headed decisions"
    
  wellbeing:
    max_consecutive_weeks: 1
    post_incident_cooldown: "If paged 3+ times overnight, take next morning off"
    quarterly_review: "Review page frequency, reduce unnecessary alerts"

Communication Templates

## Incident Declared (Slack/Status Page)

🔴 **INCIDENT: [Service Name] — [Brief Description]**
**Severity:** SEV-1
**Impact:** [What users are experiencing]
**Start time:** [Timestamp]
**Incident Commander:** @name
**Status:** Investigating

Updates every 30 minutes.

## Incident Resolved

🟢 **RESOLVED: [Service Name] — [Brief Description]**
**Duration:** 47 minutes
**Root Cause:** [One-line summary]
**Impact:** [Number of users affected]
**Postmortem:** Scheduled for [date]

Anti-Patterns

Anti-Pattern	Problem	Fix
No on-call compensation	Engineers resent on-call duty	Pay flat rate + per-page bonus
Pages for non-actionable alerts	Alert fatigue, pages ignored	Every page must have an action
Solo on-call for complex systems	Burnout, knowledge gap	Primary + secondary on-call
No handoff notes	New on-call unaware of state	Structured handoff with active issues
Postmortems only for SEV-1	Miss learning from smaller incidents	Postmortem for SEV-1 and SEV-2

Checklist

On-call rotation: minimum 4 people, weekly rotation
Escalation policy: documented, tested
Response time SLA: 15 min for SEV-1, 30 min for SEV-2
Runbooks: every page alert has a runbook
Status page: public, updated during incidents
Communication: templates for declaration, updates, resolution
Compensation: on-call pay + per-page bonus
Postmortems: blameless, within 72 hours, action items tracked

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For incident management consulting, visit garnetgrid.com. :::

Incident Response Framework

Roles During Incidents

On-Call Rotation

Communication Templates

Anti-Patterns

Checklist

More in DevOps & CI/CD

Chaos Engineering in Practice

Canary Deployments

CI/CD Pipeline Maturity Model