ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

DevOps Incident Management & On-Call

Build sustainable on-call practices. Covers incident response frameworks, escalation policies, on-call rotations, runbooks, communication protocols, and reducing alert fatigue.

On-call is the most dreaded part of engineering culture because most organizations do it badly. Pages at 3 AM for non-urgent issues. Alerts with no runbooks. Escalation paths that dead-end. Weeks-long rotations that burn out engineers. Good on-call practices make the difference between a sustainable engineering culture and one that hemorrhages talent.


Incident Response Framework

DETECT → TRIAGE → RESPOND → RESOLVE → REVIEW

1. DETECT (automated)
   Alert fires via monitoring/observability
   
2. TRIAGE (< 5 min)
   On-call acknowledges, assess severity
   
3. RESPOND (severity-dependent)
   SEV-1: Incident commander, war room, all-hands
   SEV-2: On-call + relevant team
   SEV-3: On-call handles solo
   
4. RESOLVE
   Mitigate impact, then root cause fix
   
5. REVIEW
   Blameless postmortem within 72 hours

Roles During Incidents

RoleResponsibility
Incident CommanderCoordinates response, makes decisions, communicates status
Technical LeadLeads investigation and mitigation
CommunicationsUpdates status page, Slack channels, customers
ScribeDocuments timeline as events happen
Subject Matter ExpertCalled in for specific system expertise

On-Call Rotation

on_call:
  rotation:
    type: weekly
    handoff: Monday 10:00 AM (during work hours)
    
  team_size_minimum: 4  # No one on-call more than 25%
  
  compensation:
    on_call_pay: $500/week flat
    page_bonus: $50 per page outside business hours
    
  expectations:
    response_time: 15 minutes
    laptop_and_connectivity: required
    alcohol_policy: "stay capable of clear-headed decisions"
    
  wellbeing:
    max_consecutive_weeks: 1
    post_incident_cooldown: "If paged 3+ times overnight, take next morning off"
    quarterly_review: "Review page frequency, reduce unnecessary alerts"

Communication Templates

## Incident Declared (Slack/Status Page)

🔴 **INCIDENT: [Service Name] — [Brief Description]**
**Severity:** SEV-1
**Impact:** [What users are experiencing]
**Start time:** [Timestamp]
**Incident Commander:** @name
**Status:** Investigating

Updates every 30 minutes.
## Incident Resolved

🟢 **RESOLVED: [Service Name] — [Brief Description]**
**Duration:** 47 minutes
**Root Cause:** [One-line summary]
**Impact:** [Number of users affected]
**Postmortem:** Scheduled for [date]

Anti-Patterns

Anti-PatternProblemFix
No on-call compensationEngineers resent on-call dutyPay flat rate + per-page bonus
Pages for non-actionable alertsAlert fatigue, pages ignoredEvery page must have an action
Solo on-call for complex systemsBurnout, knowledge gapPrimary + secondary on-call
No handoff notesNew on-call unaware of stateStructured handoff with active issues
Postmortems only for SEV-1Miss learning from smaller incidentsPostmortem for SEV-1 and SEV-2

Checklist

  • On-call rotation: minimum 4 people, weekly rotation
  • Escalation policy: documented, tested
  • Response time SLA: 15 min for SEV-1, 30 min for SEV-2
  • Runbooks: every page alert has a runbook
  • Status page: public, updated during incidents
  • Communication: templates for declaration, updates, resolution
  • Compensation: on-call pay + per-page bonus
  • Postmortems: blameless, within 72 hours, action items tracked

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For incident management consulting, visit garnetgrid.com. :::

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →