On-Call Engineering: Building Sustainable Incident Response Programs

On-call is the operational backbone of reliable systems. Done well, it enables rapid incident response without burning out engineers. Done poorly, it drives attrition and creates a toxic feedback loop where the most experienced engineers leave because they’re paged the most.

On-Call Program Design

Rotation Structures

Primary/Secondary Model The most common pattern. Primary responder handles alerts first. Secondary is a backup if primary doesn’t acknowledge within the SLA.

Follow-the-Sun Distributed teams rotate based on time zones, so no one is paged outside business hours. Requires teams in at least 3 time zones.

Specialist Rotation Domain experts cover specific subsystems (database, network, application). Works for large organizations but creates knowledge silos.

Rotation Cadence

Cadence	Pros	Cons
Weekly	Simple, predictable	Fatiguing if alert volume is high
Bi-weekly	Less frequent rotation overhead	Can feel long during bad weeks
Daily	Distributes pain evenly	Constant context switching
Split shift	Nobody on-call overnight	Requires more people

Compensation Models

On-call work is real work and should be compensated:

Flat on-call stipend — Fixed amount per on-call shift (e.g., $500/week)
Per-incident bonus — Additional pay per page received
Comp time — Time off equivalent to time spent responding
Shift differential — Higher multiplier for nights/weekends

Escalation Policies

Alert fires
  → Page primary on-call (5 min ack SLA)
    → No ack? Page secondary (5 min ack SLA)
      → No ack? Page engineering manager + Slack #incident channel
        → Still no ack? Page VP of Engineering + auto-bridge

Escalation Rules

Acknowledge ≠ Resolve — Acking means “I’m looking at it,” not “it’s fixed”
Escalation is not failure — Encourage early escalation for complex issues
Skip-level escalation — Allow responders to escalate directly to specialists
Auto-escalation timers — Never rely on humans to manually escalate

Alert Quality

The number one predictor of on-call burnout is alert noise. Poor alert quality means engineers get paged for things that don’t matter, which teaches them to ignore alerts — including the ones that do matter.

Alert Hygiene

Actionable — Every alert should have a clear remediation path
Urgent — If it can wait until morning, it shouldn’t page at night
Unique — Deduplicate related alerts into a single incident
Contextualized — Include relevant dashboards, runbooks, and recent changes

Alert Fatigue Metrics

Metric	Healthy	Concerning	Critical
Pages per on-call shift	< 5	5-15	> 15
False positive rate	< 10%	10-30%	> 30%
After-hours pages	< 2	2-5	> 5
Mean time to acknowledge	< 5 min	5-15 min	> 15 min

Incident Classification

Severity	Impact	Response	Example
SEV1	Complete outage, all users affected	All-hands bridge, exec notification	Payment processing down
SEV2	Degraded service, many users affected	Primary + secondary + specialist	Elevated error rates (>5%)
SEV3	Partial degradation, some users affected	Primary on-call investigates	Single region latency spike
SEV4	Minor issue, minimal user impact	Next business day	Dashboard rendering glitch

Post-Incident Reviews

Every SEV1 and SEV2 incident deserves a blameless post-incident review:

Template

## Incident Summary
- **Duration**: [Start time] — [End time]  
- **Severity**: SEV[1-4]
- **Impact**: [User-facing impact description]
- **Detection**: [How was it detected? Alert? User report?]

## Timeline
- HH:MM — Alert fired
- HH:MM — Acknowledged by [name]
- HH:MM — Root cause identified
- HH:MM — Fix deployed
- HH:MM — Verified resolved

## Root Cause
[Technical explanation]

## Contributing Factors
1. [Factor 1]
2. [Factor 2]

## Action Items
- [ ] [Preventive action] — Owner: [name] — Due: [date]
- [ ] [Detective action] — Owner: [name] — Due: [date]

Key Principles

Blameless — Focus on system failures, not individual mistakes
Actionable — Every review produces concrete action items with owners and deadlines
Shared — Published broadly so the entire organization learns
Tracked — Action items are tracked to completion (not filed and forgotten)

Tools and Platforms

Category	Tools
Alerting	PagerDuty, Opsgenie, Grafana OnCall, VictorOps
Communication	Slack, Microsoft Teams (incident channels)
Incident Tracking	Jira, Linear, incident.io, Rootly
Status Pages	Statuspage.io, Cachet, Instatus
Post-Mortems	Confluence, Notion, Blameless

Building a Sustainable Program

Signs of a Healthy On-Call Program

Engineers volunteer for on-call rotations
Alert-to-action ratio is high (> 90% actionable)
MTTR is consistently improving
Post-incident actions are completed on schedule
On-call load is evenly distributed across the team

Signs of a Toxic On-Call Program

Engineers dread their on-call weeks
Same people are always on-call (knowledge silo problem)
Alerts fire constantly but rarely require action
Post-incident reviews don’t happen or don’t produce results
Compensation doesn’t reflect the burden

Recovery Strategies

If your on-call program is already toxic:

Declare alert bankruptcy — Silence every alert, then re-enable only the ones that are truly actionable
Invest in reliability — Fix the top 5 sources of pages
Redistribute the load — Ensure every team member participates equally
Compensate fairly — Match compensation to the actual burden
Measure and share — Make alert quality metrics visible to leadership

The goal is a program where being on-call is a manageable responsibility, not a punishment.