On-Call Engineering: Building Sustainable Incident Response Programs
How to design on-call rotations, escalation policies, and incident response workflows that are sustainable, fair, and effective at keeping services reliable.
On-call is the operational backbone of reliable systems. Done well, it enables rapid incident response without burning out engineers. Done poorly, it drives attrition and creates a toxic feedback loop where the most experienced engineers leave because they’re paged the most.
On-Call Program Design
Rotation Structures
Primary/Secondary Model The most common pattern. Primary responder handles alerts first. Secondary is a backup if primary doesn’t acknowledge within the SLA.
Follow-the-Sun Distributed teams rotate based on time zones, so no one is paged outside business hours. Requires teams in at least 3 time zones.
Specialist Rotation Domain experts cover specific subsystems (database, network, application). Works for large organizations but creates knowledge silos.
Rotation Cadence
| Cadence | Pros | Cons |
|---|---|---|
| Weekly | Simple, predictable | Fatiguing if alert volume is high |
| Bi-weekly | Less frequent rotation overhead | Can feel long during bad weeks |
| Daily | Distributes pain evenly | Constant context switching |
| Split shift | Nobody on-call overnight | Requires more people |
Compensation Models
On-call work is real work and should be compensated:
- Flat on-call stipend — Fixed amount per on-call shift (e.g., $500/week)
- Per-incident bonus — Additional pay per page received
- Comp time — Time off equivalent to time spent responding
- Shift differential — Higher multiplier for nights/weekends
Escalation Policies
Alert fires
→ Page primary on-call (5 min ack SLA)
→ No ack? Page secondary (5 min ack SLA)
→ No ack? Page engineering manager + Slack #incident channel
→ Still no ack? Page VP of Engineering + auto-bridge
Escalation Rules
- Acknowledge ≠ Resolve — Acking means “I’m looking at it,” not “it’s fixed”
- Escalation is not failure — Encourage early escalation for complex issues
- Skip-level escalation — Allow responders to escalate directly to specialists
- Auto-escalation timers — Never rely on humans to manually escalate
Alert Quality
The number one predictor of on-call burnout is alert noise. Poor alert quality means engineers get paged for things that don’t matter, which teaches them to ignore alerts — including the ones that do matter.
Alert Hygiene
- Actionable — Every alert should have a clear remediation path
- Urgent — If it can wait until morning, it shouldn’t page at night
- Unique — Deduplicate related alerts into a single incident
- Contextualized — Include relevant dashboards, runbooks, and recent changes
Alert Fatigue Metrics
| Metric | Healthy | Concerning | Critical |
|---|---|---|---|
| Pages per on-call shift | < 5 | 5-15 | > 15 |
| False positive rate | < 10% | 10-30% | > 30% |
| After-hours pages | < 2 | 2-5 | > 5 |
| Mean time to acknowledge | < 5 min | 5-15 min | > 15 min |
Incident Classification
| Severity | Impact | Response | Example |
|---|---|---|---|
| SEV1 | Complete outage, all users affected | All-hands bridge, exec notification | Payment processing down |
| SEV2 | Degraded service, many users affected | Primary + secondary + specialist | Elevated error rates (>5%) |
| SEV3 | Partial degradation, some users affected | Primary on-call investigates | Single region latency spike |
| SEV4 | Minor issue, minimal user impact | Next business day | Dashboard rendering glitch |
Post-Incident Reviews
Every SEV1 and SEV2 incident deserves a blameless post-incident review:
Template
## Incident Summary
- **Duration**: [Start time] — [End time]
- **Severity**: SEV[1-4]
- **Impact**: [User-facing impact description]
- **Detection**: [How was it detected? Alert? User report?]
## Timeline
- HH:MM — Alert fired
- HH:MM — Acknowledged by [name]
- HH:MM — Root cause identified
- HH:MM — Fix deployed
- HH:MM — Verified resolved
## Root Cause
[Technical explanation]
## Contributing Factors
1. [Factor 1]
2. [Factor 2]
## Action Items
- [ ] [Preventive action] — Owner: [name] — Due: [date]
- [ ] [Detective action] — Owner: [name] — Due: [date]
Key Principles
- Blameless — Focus on system failures, not individual mistakes
- Actionable — Every review produces concrete action items with owners and deadlines
- Shared — Published broadly so the entire organization learns
- Tracked — Action items are tracked to completion (not filed and forgotten)
Tools and Platforms
| Category | Tools |
|---|---|
| Alerting | PagerDuty, Opsgenie, Grafana OnCall, VictorOps |
| Communication | Slack, Microsoft Teams (incident channels) |
| Incident Tracking | Jira, Linear, incident.io, Rootly |
| Status Pages | Statuspage.io, Cachet, Instatus |
| Post-Mortems | Confluence, Notion, Blameless |
Building a Sustainable Program
Signs of a Healthy On-Call Program
- Engineers volunteer for on-call rotations
- Alert-to-action ratio is high (> 90% actionable)
- MTTR is consistently improving
- Post-incident actions are completed on schedule
- On-call load is evenly distributed across the team
Signs of a Toxic On-Call Program
- Engineers dread their on-call weeks
- Same people are always on-call (knowledge silo problem)
- Alerts fire constantly but rarely require action
- Post-incident reviews don’t happen or don’t produce results
- Compensation doesn’t reflect the burden
Recovery Strategies
If your on-call program is already toxic:
- Declare alert bankruptcy — Silence every alert, then re-enable only the ones that are truly actionable
- Invest in reliability — Fix the top 5 sources of pages
- Redistribute the load — Ensure every team member participates equally
- Compensate fairly — Match compensation to the actual burden
- Measure and share — Make alert quality metrics visible to leadership
The goal is a program where being on-call is a manageable responsibility, not a punishment.