ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

On-Call Engineering: Building Sustainable Incident Response Programs

How to design on-call rotations, escalation policies, and incident response workflows that are sustainable, fair, and effective at keeping services reliable.

On-call is the operational backbone of reliable systems. Done well, it enables rapid incident response without burning out engineers. Done poorly, it drives attrition and creates a toxic feedback loop where the most experienced engineers leave because they’re paged the most.

On-Call Program Design

Rotation Structures

Primary/Secondary Model The most common pattern. Primary responder handles alerts first. Secondary is a backup if primary doesn’t acknowledge within the SLA.

Follow-the-Sun Distributed teams rotate based on time zones, so no one is paged outside business hours. Requires teams in at least 3 time zones.

Specialist Rotation Domain experts cover specific subsystems (database, network, application). Works for large organizations but creates knowledge silos.

Rotation Cadence

CadenceProsCons
WeeklySimple, predictableFatiguing if alert volume is high
Bi-weeklyLess frequent rotation overheadCan feel long during bad weeks
DailyDistributes pain evenlyConstant context switching
Split shiftNobody on-call overnightRequires more people

Compensation Models

On-call work is real work and should be compensated:

  • Flat on-call stipend — Fixed amount per on-call shift (e.g., $500/week)
  • Per-incident bonus — Additional pay per page received
  • Comp time — Time off equivalent to time spent responding
  • Shift differential — Higher multiplier for nights/weekends

Escalation Policies

Alert fires
  → Page primary on-call (5 min ack SLA)
    → No ack? Page secondary (5 min ack SLA)
      → No ack? Page engineering manager + Slack #incident channel
        → Still no ack? Page VP of Engineering + auto-bridge

Escalation Rules

  1. Acknowledge ≠ Resolve — Acking means “I’m looking at it,” not “it’s fixed”
  2. Escalation is not failure — Encourage early escalation for complex issues
  3. Skip-level escalation — Allow responders to escalate directly to specialists
  4. Auto-escalation timers — Never rely on humans to manually escalate

Alert Quality

The number one predictor of on-call burnout is alert noise. Poor alert quality means engineers get paged for things that don’t matter, which teaches them to ignore alerts — including the ones that do matter.

Alert Hygiene

  • Actionable — Every alert should have a clear remediation path
  • Urgent — If it can wait until morning, it shouldn’t page at night
  • Unique — Deduplicate related alerts into a single incident
  • Contextualized — Include relevant dashboards, runbooks, and recent changes

Alert Fatigue Metrics

MetricHealthyConcerningCritical
Pages per on-call shift< 55-15> 15
False positive rate< 10%10-30%> 30%
After-hours pages< 22-5> 5
Mean time to acknowledge< 5 min5-15 min> 15 min

Incident Classification

SeverityImpactResponseExample
SEV1Complete outage, all users affectedAll-hands bridge, exec notificationPayment processing down
SEV2Degraded service, many users affectedPrimary + secondary + specialistElevated error rates (>5%)
SEV3Partial degradation, some users affectedPrimary on-call investigatesSingle region latency spike
SEV4Minor issue, minimal user impactNext business dayDashboard rendering glitch

Post-Incident Reviews

Every SEV1 and SEV2 incident deserves a blameless post-incident review:

Template

## Incident Summary
- **Duration**: [Start time] — [End time]  
- **Severity**: SEV[1-4]
- **Impact**: [User-facing impact description]
- **Detection**: [How was it detected? Alert? User report?]

## Timeline
- HH:MM — Alert fired
- HH:MM — Acknowledged by [name]
- HH:MM — Root cause identified
- HH:MM — Fix deployed
- HH:MM — Verified resolved

## Root Cause
[Technical explanation]

## Contributing Factors
1. [Factor 1]
2. [Factor 2]

## Action Items
- [ ] [Preventive action] — Owner: [name] — Due: [date]
- [ ] [Detective action] — Owner: [name] — Due: [date]

Key Principles

  • Blameless — Focus on system failures, not individual mistakes
  • Actionable — Every review produces concrete action items with owners and deadlines
  • Shared — Published broadly so the entire organization learns
  • Tracked — Action items are tracked to completion (not filed and forgotten)

Tools and Platforms

CategoryTools
AlertingPagerDuty, Opsgenie, Grafana OnCall, VictorOps
CommunicationSlack, Microsoft Teams (incident channels)
Incident TrackingJira, Linear, incident.io, Rootly
Status PagesStatuspage.io, Cachet, Instatus
Post-MortemsConfluence, Notion, Blameless

Building a Sustainable Program

Signs of a Healthy On-Call Program

  • Engineers volunteer for on-call rotations
  • Alert-to-action ratio is high (> 90% actionable)
  • MTTR is consistently improving
  • Post-incident actions are completed on schedule
  • On-call load is evenly distributed across the team

Signs of a Toxic On-Call Program

  • Engineers dread their on-call weeks
  • Same people are always on-call (knowledge silo problem)
  • Alerts fire constantly but rarely require action
  • Post-incident reviews don’t happen or don’t produce results
  • Compensation doesn’t reflect the burden

Recovery Strategies

If your on-call program is already toxic:

  1. Declare alert bankruptcy — Silence every alert, then re-enable only the ones that are truly actionable
  2. Invest in reliability — Fix the top 5 sources of pages
  3. Redistribute the load — Ensure every team member participates equally
  4. Compensate fairly — Match compensation to the actual burden
  5. Measure and share — Make alert quality metrics visible to leadership

The goal is a program where being on-call is a manageable responsibility, not a punishment.

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →