On-Call Compensation and Burnout Prevention
Design on-call programs that compensate engineers fairly, prevent burnout, and maintain team morale. Covers compensation models, workload balancing, schedule design, burnout indicators, escalation policies, and the organizational practices that make on-call sustainable long-term.
On-call is a tax on engineers’ personal time. When that tax is unacknowledged — no extra pay, no time off, no reduction in sprint work — engineers burn out or leave. The teams with the worst on-call programs have the highest attrition rates, and replacing an experienced engineer costs 6-12 months of productivity.
This guide covers how to build an on-call program that is sustainable, fair, and does not require heroism.
Compensation Models
| Model | How It Works | Pros | Cons |
|---|---|---|---|
| Flat stipend | Fixed amount per on-call week | Simple, predictable | Does not reflect actual workload |
| Hourly rate | Extra pay for hours actively responding | Directly proportional to effort | Complex to track |
| Comp time | Paid time off after on-call week | Addresses fatigue directly | Team needs coverage during comp time |
| Combination | Stipend + overtime for incidents | Fair and predictable | Most complex to administer |
| Nothing | ”It’s part of the job” | ❌ | Attrition, burnout, resentment |
Recommended: Combination Model
Base on-call stipend: $500/week
Covers: carrying the pager, being available, responding to pages
Incident compensation:
During business hours: included in salary (no extra)
After hours (6 PM - 9 AM): $100/hour for active incident work
Weekends/holidays: $150/hour for active incident work
Comp time:
After every on-call shift: 1 comp day (taken within 2 weeks)
After a major incident (> 2 hours after-hours): additional comp day
Sprint load:
On-call week = 50% sprint capacity (not 100%)
The other 50% is reserved for incident response and toil
Burnout Indicators
| Indicator | Healthy | Warning | Crisis |
|---|---|---|---|
| Pages per shift | < 5 per week (business hours) | 5-15 per week | > 15 per week |
| After-hours pages | < 1 per week | 1-3 per week | > 3 per week |
| Mean time to resolve | < 30 minutes | 30-60 minutes | > 60 minutes |
| On-call frequency | Once every 4-6 weeks | Once every 2-3 weeks | Every other week or more |
| Team sentiment | ”On-call is manageable" | "On-call is annoying but OK" | "I dread my on-call week” |
On-Call Health Dashboard
On-Call Health Report — March 2024
Team: Checkout
Rotation size: 6 engineers
On-call frequency: once every 6 weeks ✅
This month's paging load:
Total pages: 23 (avg 5.75/week)
After-hours pages: 4 (avg 1/week) ⚠️
False alarms: 8 (35%) ❌ — needs alert tuning
Top paging sources:
1. checkout-api timeout alerts: 12 (52%) — investigate root cause
2. payment webhook failures: 5 (22%) — upstream issue, add retry
3. database connection pool: 4 (17%) — tune pool settings
4. false alarms: 2 (9%) — remove or tune these alerts
Action items:
□ Investigate checkout-api timeout root cause (reduces 52% of pages)
□ Add retry logic for payment webhooks
□ Tune database connection pool alert threshold
□ Remove 2 false alarm alerts
Schedule Design
6-person rotation (recommended minimum):
Week 1: Alice (primary), Bob (secondary)
Week 2: Bob (primary), Carol (secondary)
Week 3: Carol (primary), Dave (secondary)
Week 4: Dave (primary), Eve (secondary)
Week 5: Eve (primary), Frank (secondary)
Week 6: Frank (primary), Alice (secondary)
Result: each person is on-call once every 6 weeks
Secondary on-call:
- Backup if primary is unavailable
- Automatically escalated to after 15 minutes
- NOT expected to wake up for every page
Protected Time Rules
| Rule | Why |
|---|---|
| No on-call during PTO | On-call and vacation are incompatible |
| No back-to-back on-call weeks | Consecutive weeks cause acute burnout |
| Swap requests honored within 48 hours | People have lives — emergencies happen |
| New hires exempt for first 3 months | Need ramp-up time and shadow shifts first |
| On-call week = reduced sprint work | Cannot do 100% project work AND respond to incidents |
Escalation Policy
Incident occurs:
T+0: Primary on-call paged (PagerDuty, phone call)
T+5min: If no acknowledgment → escalate to secondary
T+15min: If no acknowledgment → escalate to engineering manager
T+30min: If unresolved → engage senior engineer or architect
T+60min: If still unresolved → incident commander, broader team
At any point: on-call engineer can request help without stigma
"I need help" is a positive signal, not a failure
Implementation Checklist
- Compensate on-call: stipend + hourly rate for after-hours incidents
- Give comp time after every on-call shift (1 day off within 2 weeks)
- Reduce sprint capacity to 50% during on-call weeks
- Maintain 6+ person rotation (no more frequent than once every 4 weeks)
- Track paging load: target < 5 pages/week, < 1 after-hours page/week
- Audit false alarms monthly and eliminate them aggressively
- Shadow new engineers for 2 on-call shifts before adding to rotation
- Never schedule on-call during PTO or back-to-back weeks
- Run quarterly on-call retrospectives: what caused the most pain?
- Make “asking for help” explicitly encouraged and non-penalized