On-Call That Does Not Destroy Your Team: Designing Sustainable Incident Response
Build an on-call rotation that engineers do not dread. Covers compensation, escalation design, runbook architecture, alert fatigue elimination, and the cultural patterns that separate healthy on-call from burnout factories.
On-call is the tax you pay for running software in production. Like actual taxes, it can be structured fairly or it can be predatory. Most engineering organizations land on predatory — not out of malice, but out of neglect. Nobody designs the on-call experience. It just happens, and by the time anyone notices it is broken, three engineers have quit and the remaining ones have the thousand-yard stare of people who have been woken up at 3 AM too many times.
This guide covers how to design on-call that respects human beings while still keeping production alive.
The On-Call Health Assessment
Before redesigning anything, diagnose where you are:
| Metric | Healthy | Unhealthy | Critical |
|---|---|---|---|
| Pages per on-call shift | 0-2 | 3-10 | 10+ |
| Pages outside business hours | 0-1 | 2-5 | 5+ |
| False positive rate | < 10% | 10-50% | > 50% |
| Mean time to acknowledge | < 5 min | 5-15 min | > 15 min |
| Mean time to resolve | < 30 min | 30 min - 2 hr | > 2 hr |
| On-call engineer satisfaction | ”Manageable" | "Stressful" | "Looking for a new job” |
| Follow-up action items completed | > 80% | 50-80% | < 50% |
The single most important metric is pages-per-shift. If your on-call engineers are getting paged more than twice per shift, your alerts are broken, not your engineers. Fix the alerts.
Rotation Design
The Math of Fair Rotations
Team size: 6 engineers
Rotation length: 1 week
On-call frequency: Every 6 weeks
Annual on-call burden: ~8.7 weeks per engineer
Team size: 4 engineers (common for startups)
Rotation length: 1 week
On-call frequency: Every 4 weeks
Annual on-call burden: ~13 weeks per engineer ← This burns people out
Minimum viable team for sustainable on-call: 5 engineers
Below 5, consider shared on-call across teams or follow-the-sun.
Rotation Structures
| Structure | Best For | Tradeoff |
|---|---|---|
| Weekly rotation | Most teams | Predictable, but a bad week is painful |
| Daily rotation | Small teams | Spreads burden, but constant handoffs |
| Follow-the-sun | Distributed teams | Nobody gets paged at night. Requires 3+ timezones |
| Primary/secondary | Critical services | Always has backup, but doubles the on-call pool needed |
| Tiered by severity | Large organizations | P1 goes to SRE, P2 to team on-call. Reduces noise per person |
Follow-the-Sun Configuration
If you have engineers in enough timezones, nobody should ever be paged at 3 AM:
US Pacific (UTC-8): 06:00 - 14:00 local = 14:00 - 22:00 UTC
US Eastern (UTC-5): 14:00 - 22:00 local = 19:00 - 03:00 UTC
Europe (UTC+1): 08:00 - 16:00 local = 07:00 - 15:00 UTC
Asia Pacific (UTC+9): 09:00 - 17:00 local = 00:00 - 08:00 UTC
Result: 24-hour coverage, nobody works outside normal hours.
Alert Design: The War on False Positives
Alert fatigue is the number one cause of failed incident response. When engineers see 50 alerts a day, they treat them all like noise — including the one that matters.
Alert Classification Framework
# Every alert must answer: "What should a human do RIGHT NOW?"
# If the answer is "nothing" — it is not an alert, it is a log entry.
severity_levels:
P1_critical:
definition: "Customer-facing impact. Revenue loss or data risk."
response: "Immediate page. Wake people up."
examples:
- "Payment processing is down"
- "Data loss detected"
- "Security breach indicator"
sla: "Acknowledge in 5 min. Mitigate in 30 min."
P2_high:
definition: "Degraded experience. Customers affected but service functional."
response: "Page during business hours. Slack + escalation off-hours."
examples:
- "API latency > 2x normal for 10 minutes"
- "Error rate > 5% for 5 minutes"
- "Database connection pool > 80% for 15 minutes"
sla: "Acknowledge in 15 min. Mitigate in 2 hours."
P3_warning:
definition: "Potential issue. No customer impact yet."
response: "Slack notification. Address next business day."
examples:
- "Disk usage > 70%"
- "Certificate expires in 14 days"
- "Dependency deprecation warning"
sla: "Address within 1 business week."
informational:
definition: "Good to know. No action required."
response: "Dashboard only. Never page or Slack."
examples:
- "Deploy completed successfully"
- "Nightly backup completed"
- "Auto-scaler added 2 nodes"
Alert Quality Rules
| Rule | Why |
|---|---|
| Every alert must have a runbook link | So the engineer knows what to do, not just what is broken |
| Every alert must fire on symptoms, not causes | ”Users cannot check out” not “CPU is at 90%“ |
| Every alert that fires without action gets deleted | If nobody does anything, it is noise |
| Alert thresholds must be based on SLOs, not gut feel | ”Error rate > 0.1% of error budget” not “error rate > 1%“ |
| Alerts must have a minimum duration before firing | Prevent transient spikes from paging: for: 5m minimum |
Runbook Architecture
A runbook is the difference between a 5-minute resolution and a 2-hour one. Every alert should link to a runbook that tells the engineer exactly what to do.
Runbook Template
# [Service Name]: [Alert Name]
## What is happening
One sentence: what is broken and who is affected.
## Severity and Impact
- Customer impact: [None / Degraded / Down]
- Revenue impact: [None / Estimated $/minute]
- Blast radius: [Single user / Segment / All users]
## Immediate Actions (Do These First)
1. Check dashboard: [link]
2. Check recent deploys: [link]
3. If recent deploy, rollback: `kubectl rollout undo deployment/[service]`
## Diagnosis
- Check logs: `kubectl logs -l app=[service] --since=15m | grep ERROR`
- Check database: [query to run]
- Check dependencies: [health check URLs]
## Common Root Causes
| Symptom | Likely Cause | Fix |
|---|---|---|
| 5xx errors spike after deploy | Bad code deploy | Rollback |
| Latency spike, no deploy | Database slow query | Kill query, add index |
| Connection refused | Pod OOM killed | Increase memory limit |
## Escalation
- Primary on-call cannot resolve in 30 min → escalate to [secondary]
- Secondary cannot resolve in 1 hour → escalate to [engineering manager]
- Data breach suspected → immediately notify [security team + legal]
## Post-Incident
- File incident report: [link]
- Schedule post-mortem if P1 or P2
Compensation and Recognition
On-call engineers are doing work outside normal hours to keep your business running. If you do not compensate them, you do not value their time, and they will go somewhere that does.
Compensation Models
| Model | Structure | Fairness |
|---|---|---|
| Flat stipend | $500-$1,000/week on-call | Simple, predictable |
| Per-page bonus | $50-$200 per off-hours page | Incentivizes actually responding |
| Comp time | Day off after every on-call shift | Prevents burnout accumulation |
| Combined | Stipend + comp time | Best of both worlds |
| Nothing | ”It is part of the job” | Engineers will leave |
The bare minimum: If you cannot pay a stipend, give comp time. One day off for every week of on-call duty. This is not generous — it is basic respect.
Implementation Checklist
- Measure current on-call health: pages per shift, false positive rate, engineer satisfaction
- Audit every alert: can the engineer take action? If not, delete the alert
- Write runbooks for every P1 and P2 alert (start here — P3 can wait)
- Design rotation with minimum 5 engineers; go follow-the-sun if distributed
- Implement primary/secondary for critical services
- Establish compensation: stipend, comp time, or both
- Hold monthly on-call retrospectives: what alerts were noise? what runbooks were wrong?
- Track and publish on-call health metrics publicly within the team