Sustainable Incident On-Call
Design on-call rotations that protect engineer health while maintaining service reliability. Covers rotation design, escalation policies, compensation models, burnout prevention, alert quality, and the organizational practices that make on-call sustainable.
On-call should not be a punishment. Engineers who dread on-call produce worse code, miss more incidents, and eventually leave the company. Sustainable on-call requires deliberate design — of rotations, alerts, escalation paths, and most importantly, organizational incentives.
Rotation Design
Minimum viable rotation:
- At least 6-8 engineers in rotation
- 1 week on-call shifts (not 2+)
- Follow-the-sun for global teams
- Clear primary + secondary escalation
Schedule:
Week 1: Alice (primary), Bob (secondary)
Week 2: Bob (primary), Carol (secondary)
Week 3: Carol (primary), Dave (secondary)
...
Secondary only paged if primary doesn't acknowledge in 10 minutes
After-hours: Primary handles Sev1/Sev2 only
Business hours: Primary handles all severities
Alert Quality Metrics
Signal-to-noise ratio:
Total alerts per week: 40
Actionable alerts: 15 (37.5%)
False positives: 25 (62.5%) ← THIS IS THE PROBLEM
Target:
Actionable rate: > 80%
False positive rate: < 20%
Alerts per on-call shift: < 20 (including business hours)
Night pages (11PM-7AM): < 2 per week
Metrics to track:
- Alerts per shift (total and after-hours)
- Time-to-acknowledge (< 5 min target)
- Time-to-resolve per severity
- False positive rate
- Escalation rate
- Sleep disruption count
Escalation Policies
escalation_policy:
sev1_critical: # Revenue-impacting, full outage
timeout: 5 min
chain:
- primary_on_call
- secondary_on_call (after 5 min)
- engineering_manager (after 15 min)
- vp_engineering (after 30 min)
channels: phone, sms, slack
sev2_high: # Degraded service, partial impact
timeout: 15 min
chain:
- primary_on_call
- secondary_on_call (after 15 min)
channels: push_notification, slack
sev3_medium: # Non-urgent, business hours response
timeout: 4 hours
chain:
- primary_on_call
channels: slack_only
restriction: business_hours_only
Burnout Prevention
Organizational practices:
☐ No more than 1 week on-call per 6-8 week cycle
☐ Comp time: day off after on-call week with pages
☐ Monetary compensation for after-hours pages
☐ Blameless postmortems (learn, don't punish)
☐ Toil budget: 20% of sprint for alert cleanup
☐ Regular alert review: delete noisy alerts
Individual protections:
☐ Right to silence phone during protected sleep hours
☐ Secondary on-call handles overflow
☐ Escalation to management for extended incidents
☐ No on-call during PTO, sick leave, or parental leave
Red flags:
⚠ Same person always on-call (team too small)
⚠ Alerts that page but require no action
⚠ Incident post-mortems result in more alerts (not fewer)
⚠ Engineers refuse to join on-call rotation
Compensation Models
| Model | How It Works | When |
|---|---|---|
| Flat stipend | $500-$1,500/month for being on-call | Small teams, simple rotation |
| Per-page payment | $50-$200 per after-hours page | High-reliability teams |
| Comp time | 1 day off per on-call week | Organizations that value time |
| Hybrid | Stipend + per-page + comp time | Best practice for large orgs |
| Nothing | No compensation | Fastest way to lose engineers |
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| No on-call compensation | Engineers resent on-call, attrition | Stipend + comp time minimum |
| Alert fatigue (50+ alerts/week) | Real incidents missed | Aggressive alert cleanup |
| Same 2-3 people always on-call | Burnout, knowledge concentration | Grow rotation to 6-8 minimum |
| No follow-the-sun | Night pages for everyone | Distribute across time zones |
| Heroic culture (“just deal with it”) | Unsustainable, leads to burnout | Systematic, measured on-call health |
On-call is a team responsibility, not an individual burden. If your on-call rotation is burning people out, the problem is the system, not the people.