ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

On-Call That Does Not Destroy Your Team: Designing Sustainable Incident Response

Build an on-call rotation that engineers do not dread. Covers compensation, escalation design, runbook architecture, alert fatigue elimination, and the cultural patterns that separate healthy on-call from burnout factories.

On-call is the tax you pay for running software in production. Like actual taxes, it can be structured fairly or it can be predatory. Most engineering organizations land on predatory — not out of malice, but out of neglect. Nobody designs the on-call experience. It just happens, and by the time anyone notices it is broken, three engineers have quit and the remaining ones have the thousand-yard stare of people who have been woken up at 3 AM too many times.

This guide covers how to design on-call that respects human beings while still keeping production alive.


The On-Call Health Assessment

Before redesigning anything, diagnose where you are:

MetricHealthyUnhealthyCritical
Pages per on-call shift0-23-1010+
Pages outside business hours0-12-55+
False positive rate< 10%10-50%> 50%
Mean time to acknowledge< 5 min5-15 min> 15 min
Mean time to resolve< 30 min30 min - 2 hr> 2 hr
On-call engineer satisfaction”Manageable""Stressful""Looking for a new job”
Follow-up action items completed> 80%50-80%< 50%

The single most important metric is pages-per-shift. If your on-call engineers are getting paged more than twice per shift, your alerts are broken, not your engineers. Fix the alerts.


Rotation Design

The Math of Fair Rotations

Team size: 6 engineers
Rotation length: 1 week
On-call frequency: Every 6 weeks
Annual on-call burden: ~8.7 weeks per engineer

Team size: 4 engineers (common for startups)
Rotation length: 1 week
On-call frequency: Every 4 weeks
Annual on-call burden: ~13 weeks per engineer ← This burns people out

Minimum viable team for sustainable on-call: 5 engineers
Below 5, consider shared on-call across teams or follow-the-sun.

Rotation Structures

StructureBest ForTradeoff
Weekly rotationMost teamsPredictable, but a bad week is painful
Daily rotationSmall teamsSpreads burden, but constant handoffs
Follow-the-sunDistributed teamsNobody gets paged at night. Requires 3+ timezones
Primary/secondaryCritical servicesAlways has backup, but doubles the on-call pool needed
Tiered by severityLarge organizationsP1 goes to SRE, P2 to team on-call. Reduces noise per person

Follow-the-Sun Configuration

If you have engineers in enough timezones, nobody should ever be paged at 3 AM:

US Pacific (UTC-8):   06:00 - 14:00 local = 14:00 - 22:00 UTC
US Eastern (UTC-5):   14:00 - 22:00 local = 19:00 - 03:00 UTC
Europe (UTC+1):       08:00 - 16:00 local = 07:00 - 15:00 UTC
Asia Pacific (UTC+9): 09:00 - 17:00 local = 00:00 - 08:00 UTC

Result: 24-hour coverage, nobody works outside normal hours.

Alert Design: The War on False Positives

Alert fatigue is the number one cause of failed incident response. When engineers see 50 alerts a day, they treat them all like noise — including the one that matters.

Alert Classification Framework

# Every alert must answer: "What should a human do RIGHT NOW?"
# If the answer is "nothing" — it is not an alert, it is a log entry.

severity_levels:
  P1_critical:
    definition: "Customer-facing impact. Revenue loss or data risk."
    response: "Immediate page. Wake people up."
    examples:
      - "Payment processing is down"
      - "Data loss detected"
      - "Security breach indicator"
    sla: "Acknowledge in 5 min. Mitigate in 30 min."

  P2_high:
    definition: "Degraded experience. Customers affected but service functional."
    response: "Page during business hours. Slack + escalation off-hours."
    examples:
      - "API latency > 2x normal for 10 minutes"
      - "Error rate > 5% for 5 minutes"
      - "Database connection pool > 80% for 15 minutes"
    sla: "Acknowledge in 15 min. Mitigate in 2 hours."

  P3_warning:
    definition: "Potential issue. No customer impact yet."
    response: "Slack notification. Address next business day."
    examples:
      - "Disk usage > 70%"
      - "Certificate expires in 14 days"
      - "Dependency deprecation warning"
    sla: "Address within 1 business week."

  informational:
    definition: "Good to know. No action required."
    response: "Dashboard only. Never page or Slack."
    examples:
      - "Deploy completed successfully"
      - "Nightly backup completed"
      - "Auto-scaler added 2 nodes"

Alert Quality Rules

RuleWhy
Every alert must have a runbook linkSo the engineer knows what to do, not just what is broken
Every alert must fire on symptoms, not causes”Users cannot check out” not “CPU is at 90%“
Every alert that fires without action gets deletedIf nobody does anything, it is noise
Alert thresholds must be based on SLOs, not gut feel”Error rate > 0.1% of error budget” not “error rate > 1%“
Alerts must have a minimum duration before firingPrevent transient spikes from paging: for: 5m minimum

Runbook Architecture

A runbook is the difference between a 5-minute resolution and a 2-hour one. Every alert should link to a runbook that tells the engineer exactly what to do.

Runbook Template

# [Service Name]: [Alert Name]

## What is happening
One sentence: what is broken and who is affected.

## Severity and Impact
- Customer impact: [None / Degraded / Down]
- Revenue impact: [None / Estimated $/minute]
- Blast radius: [Single user / Segment / All users]

## Immediate Actions (Do These First)
1. Check dashboard: [link]
2. Check recent deploys: [link]
3. If recent deploy, rollback: `kubectl rollout undo deployment/[service]`

## Diagnosis
- Check logs: `kubectl logs -l app=[service] --since=15m | grep ERROR`
- Check database: [query to run]
- Check dependencies: [health check URLs]

## Common Root Causes
| Symptom | Likely Cause | Fix |
|---|---|---|
| 5xx errors spike after deploy | Bad code deploy | Rollback |
| Latency spike, no deploy | Database slow query | Kill query, add index |
| Connection refused | Pod OOM killed | Increase memory limit |

## Escalation
- Primary on-call cannot resolve in 30 min → escalate to [secondary]
- Secondary cannot resolve in 1 hour → escalate to [engineering manager]
- Data breach suspected → immediately notify [security team + legal]

## Post-Incident
- File incident report: [link]
- Schedule post-mortem if P1 or P2

Compensation and Recognition

On-call engineers are doing work outside normal hours to keep your business running. If you do not compensate them, you do not value their time, and they will go somewhere that does.

Compensation Models

ModelStructureFairness
Flat stipend$500-$1,000/week on-callSimple, predictable
Per-page bonus$50-$200 per off-hours pageIncentivizes actually responding
Comp timeDay off after every on-call shiftPrevents burnout accumulation
CombinedStipend + comp timeBest of both worlds
Nothing”It is part of the job”Engineers will leave

The bare minimum: If you cannot pay a stipend, give comp time. One day off for every week of on-call duty. This is not generous — it is basic respect.


Implementation Checklist

  • Measure current on-call health: pages per shift, false positive rate, engineer satisfaction
  • Audit every alert: can the engineer take action? If not, delete the alert
  • Write runbooks for every P1 and P2 alert (start here — P3 can wait)
  • Design rotation with minimum 5 engineers; go follow-the-sun if distributed
  • Implement primary/secondary for critical services
  • Establish compensation: stipend, comp time, or both
  • Hold monthly on-call retrospectives: what alerts were noise? what runbooks were wrong?
  • Track and publish on-call health metrics publicly within the team
Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →