On-Call Engineering
Design on-call rotations that are sustainable, effective, and respectful of engineers' time. Covers rotation design, escalation policies, runbook standards, compensation models, and preventing on-call burnout.
On-call is the commitment that production is someone’s explicit responsibility at every moment. Without on-call, incidents are responded to by whoever happens to notice — which means critical alerts at 3 AM go unnoticed until customers complain at 9 AM.
Done well, on-call is a manageable responsibility that develops deep system understanding. Done poorly, it is a burnout machine that drives engineers to quit.
Rotation Design
Team Size Requirements
Minimum viable on-call: 4 engineers
- Each engineer is on-call 1 week per month
- Allows for PTO, sick days, swaps
Healthy on-call: 6-8 engineers
- Each engineer is on-call 1 week every 6-8 weeks
- Sustainable long-term
- Buffer for coverage gaps
Rotation Patterns
Primary + Secondary Model:
Primary: First responder, handles all pages
Secondary: Backup if primary doesn't respond in 10 min, or for escalation
Week 1: Alice (P), Bob (S)
Week 2: Bob (P), Carol (S)
Week 3: Carol (P), Dave (S)
Week 4: Dave (P), Alice (S)
Handoff Protocol
End of shift checklist:
1. Document any ongoing issues in the incident channel
2. List any alerts that fired and their resolution
3. Note any alerts that need investigation but are not urgent
4. Flag any upcoming maintenance or deployments
5. Update on-call tool (PagerDuty schedule)
Escalation Policy
escalation_policy:
- level: 1
target: primary_oncall
timeout: 5_minutes
- level: 2
target: secondary_oncall
timeout: 10_minutes
- level: 3
target: engineering_manager
timeout: 15_minutes
- level: 4
target: vp_engineering
timeout: 30_minutes
Alert Quality
Alert Classification
Page (wake someone up):
- Service is down
- Error rate > 5% for 5+ minutes
- Customer-facing latency > 5s
- Data pipeline SLA breach imminent
Notify (Slack, can wait):
- Disk usage > 80%
- Elevated error rate (1-5%)
- Certificate expiring in 14 days
- Non-critical job failure
Log (dashboard only):
- CPU spike (no customer impact)
- Cache hit rate below optimal
- Non-critical dependency degraded
Alert Hygiene
Review alerts monthly:
Last Month Alert Report:
Total pages: 47
Actionable: 31 (66%) ← target: >80%
False positives: 12 (26%) ← target: <10%
Noisy (duplicate): 4 (8%) ← target: 0%
Actions:
- Tune threshold on memory alert (caused 8 false positives)
- Deduplicate disk space alerts
- Add alert for the issue that required customer escalation
Runbooks
Every alert should have a linked runbook:
# Runbook: High Error Rate on Order Service
## Alert
`order_service_error_rate > 5% for 5 minutes`
## Impact
Customers may see errors during checkout
## Diagnosis Steps
1. Check error logs: `kubectl logs -l app=order-service --tail=100`
2. Check dependent services:
- Payment service: https://grafana.internal/d/payment
- Database: https://grafana.internal/d/postgres
3. Check recent deploys: `kubectl rollout history deploy/order-service`
## Common Causes
### Bad Deploy
Symptoms: Error spike immediately after deploy
Fix: `kubectl rollout undo deploy/order-service`
### Database Connection Exhaustion
Symptoms: "connection pool exhausted" in logs
Fix: Restart service: `kubectl rollout restart deploy/order-service`
Root cause: Investigate connection leak in next business day
### Downstream Service Failure
Symptoms: Timeout errors to payment-service or inventory-service
Fix: Check downstream service status, activate circuit breaker if needed
## Escalation
If unable to resolve within 15 minutes, escalate to secondary on-call.
Compensation and Sustainability
On-Call Compensation Models
| Model | Details |
|---|---|
| Flat weekly rate | $500-1000 per on-call week |
| Per-incident bonus | Base rate + $X per page responded to |
| Time-off in lieu | 1 comp day per on-call week |
| Higher base salary | On-call expectation factored into total comp |
Burnout Prevention
- Maximum consecutive days: 7 (no multi-week rotations)
- Follow-the-sun: For global teams, hand off to a team in another timezone
- No-page improvements: Track and eliminate recurring pages
- Post-incident recovery: If paged after midnight, no meetings before noon
- Opt-out periods: Engineers can block on-call during exam weeks, moves, etc.
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| Same 2 engineers always on-call | Burnout, knowledge hoarding | Minimum 4-person rotation |
| No runbooks | On-call relies on tribal knowledge | Every alert links to a runbook |
| > 5 pages per week | Alert fatigue, pages get ignored | Tune alerts, fix noisy ones |
| No compensation | Resentment, people avoid on-call | Compensate fairly |
| No feedback loop | Same alerts page forever | Monthly alert review, action items |
On-call is not a punishment. It is an investment in production reliability and in engineers’ deep system understanding. But it must be designed with the same care as any other engineering system.