On-Call That Does Not Destroy Your Team: Designing Sustainable Incident Response

On-call is the tax you pay for running software in production. Like actual taxes, it can be structured fairly or it can be predatory. Most engineering organizations land on predatory — not out of malice, but out of neglect. Nobody designs the on-call experience. It just happens, and by the time anyone notices it is broken, three engineers have quit and the remaining ones have the thousand-yard stare of people who have been woken up at 3 AM too many times.

This guide covers how to design on-call that respects human beings while still keeping production alive.

The On-Call Health Assessment

Before redesigning anything, diagnose where you are:

Metric	Healthy	Unhealthy	Critical
Pages per on-call shift	0-2	3-10	10+
Pages outside business hours	0-1	2-5	5+
False positive rate	< 10%	10-50%	> 50%
Mean time to acknowledge	< 5 min	5-15 min	> 15 min
Mean time to resolve	< 30 min	30 min - 2 hr	> 2 hr
On-call engineer satisfaction	”Manageable"	"Stressful"	"Looking for a new job”
Follow-up action items completed	> 80%	50-80%	< 50%

The single most important metric is pages-per-shift. If your on-call engineers are getting paged more than twice per shift, your alerts are broken, not your engineers. Fix the alerts.

Rotation Design

The Math of Fair Rotations

Team size: 6 engineers
Rotation length: 1 week
On-call frequency: Every 6 weeks
Annual on-call burden: ~8.7 weeks per engineer

Team size: 4 engineers (common for startups)
Rotation length: 1 week
On-call frequency: Every 4 weeks
Annual on-call burden: ~13 weeks per engineer ← This burns people out

Minimum viable team for sustainable on-call: 5 engineers
Below 5, consider shared on-call across teams or follow-the-sun.

Rotation Structures

Structure	Best For	Tradeoff
Weekly rotation	Most teams	Predictable, but a bad week is painful
Daily rotation	Small teams	Spreads burden, but constant handoffs
Follow-the-sun	Distributed teams	Nobody gets paged at night. Requires 3+ timezones
Primary/secondary	Critical services	Always has backup, but doubles the on-call pool needed
Tiered by severity	Large organizations	P1 goes to SRE, P2 to team on-call. Reduces noise per person

Follow-the-Sun Configuration

If you have engineers in enough timezones, nobody should ever be paged at 3 AM:

US Pacific (UTC-8):   06:00 - 14:00 local = 14:00 - 22:00 UTC
US Eastern (UTC-5):   14:00 - 22:00 local = 19:00 - 03:00 UTC
Europe (UTC+1):       08:00 - 16:00 local = 07:00 - 15:00 UTC
Asia Pacific (UTC+9): 09:00 - 17:00 local = 00:00 - 08:00 UTC

Result: 24-hour coverage, nobody works outside normal hours.

Alert Design: The War on False Positives

Alert fatigue is the number one cause of failed incident response. When engineers see 50 alerts a day, they treat them all like noise — including the one that matters.

Alert Classification Framework

# Every alert must answer: "What should a human do RIGHT NOW?"
# If the answer is "nothing" — it is not an alert, it is a log entry.

severity_levels:
  P1_critical:
    definition: "Customer-facing impact. Revenue loss or data risk."
    response: "Immediate page. Wake people up."
    examples:
      - "Payment processing is down"
      - "Data loss detected"
      - "Security breach indicator"
    sla: "Acknowledge in 5 min. Mitigate in 30 min."

  P2_high:
    definition: "Degraded experience. Customers affected but service functional."
    response: "Page during business hours. Slack + escalation off-hours."
    examples:
      - "API latency > 2x normal for 10 minutes"
      - "Error rate > 5% for 5 minutes"
      - "Database connection pool > 80% for 15 minutes"
    sla: "Acknowledge in 15 min. Mitigate in 2 hours."

  P3_warning:
    definition: "Potential issue. No customer impact yet."
    response: "Slack notification. Address next business day."
    examples:
      - "Disk usage > 70%"
      - "Certificate expires in 14 days"
      - "Dependency deprecation warning"
    sla: "Address within 1 business week."

  informational:
    definition: "Good to know. No action required."
    response: "Dashboard only. Never page or Slack."
    examples:
      - "Deploy completed successfully"
      - "Nightly backup completed"
      - "Auto-scaler added 2 nodes"

Alert Quality Rules

Rule	Why
Every alert must have a runbook link	So the engineer knows what to do, not just what is broken
Every alert must fire on symptoms, not causes	”Users cannot check out” not “CPU is at 90%“
Every alert that fires without action gets deleted	If nobody does anything, it is noise
Alert thresholds must be based on SLOs, not gut feel	”Error rate > 0.1% of error budget” not “error rate > 1%“
Alerts must have a minimum duration before firing	Prevent transient spikes from paging: `for: 5m` minimum

Runbook Architecture

A runbook is the difference between a 5-minute resolution and a 2-hour one. Every alert should link to a runbook that tells the engineer exactly what to do.

Runbook Template

# [Service Name]: [Alert Name]

## What is happening
One sentence: what is broken and who is affected.

## Severity and Impact
- Customer impact: [None / Degraded / Down]
- Revenue impact: [None / Estimated $/minute]
- Blast radius: [Single user / Segment / All users]

## Immediate Actions (Do These First)
1. Check dashboard: [link]
2. Check recent deploys: [link]
3. If recent deploy, rollback: `kubectl rollout undo deployment/[service]`

## Diagnosis
- Check logs: `kubectl logs -l app=[service] --since=15m | grep ERROR`
- Check database: [query to run]
- Check dependencies: [health check URLs]

## Common Root Causes
| Symptom | Likely Cause | Fix |
|---|---|---|
| 5xx errors spike after deploy | Bad code deploy | Rollback |
| Latency spike, no deploy | Database slow query | Kill query, add index |
| Connection refused | Pod OOM killed | Increase memory limit |

## Escalation
- Primary on-call cannot resolve in 30 min → escalate to [secondary]
- Secondary cannot resolve in 1 hour → escalate to [engineering manager]
- Data breach suspected → immediately notify [security team + legal]

## Post-Incident
- File incident report: [link]
- Schedule post-mortem if P1 or P2

Compensation and Recognition

On-call engineers are doing work outside normal hours to keep your business running. If you do not compensate them, you do not value their time, and they will go somewhere that does.

Compensation Models

Model	Structure	Fairness
Flat stipend	$500-$1,000/week on-call	Simple, predictable
Per-page bonus	$50-$200 per off-hours page	Incentivizes actually responding
Comp time	Day off after every on-call shift	Prevents burnout accumulation
Combined	Stipend + comp time	Best of both worlds
Nothing	”It is part of the job”	Engineers will leave

The bare minimum: If you cannot pay a stipend, give comp time. One day off for every week of on-call duty. This is not generous — it is basic respect.

Implementation Checklist

Measure current on-call health: pages per shift, false positive rate, engineer satisfaction
Audit every alert: can the engineer take action? If not, delete the alert
Write runbooks for every P1 and P2 alert (start here — P3 can wait)
Design rotation with minimum 5 engineers; go follow-the-sun if distributed
Implement primary/secondary for critical services
Establish compensation: stipend, comp time, or both
Hold monthly on-call retrospectives: what alerts were noise? what runbooks were wrong?
Track and publish on-call health metrics publicly within the team