ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

On-Call Engineering

Design on-call rotations that are sustainable, effective, and respectful of engineers' time. Covers rotation design, escalation policies, runbook standards, compensation models, and preventing on-call burnout.

On-call is the commitment that production is someone’s explicit responsibility at every moment. Without on-call, incidents are responded to by whoever happens to notice — which means critical alerts at 3 AM go unnoticed until customers complain at 9 AM.

Done well, on-call is a manageable responsibility that develops deep system understanding. Done poorly, it is a burnout machine that drives engineers to quit.


Rotation Design

Team Size Requirements

Minimum viable on-call: 4 engineers
  - Each engineer is on-call 1 week per month
  - Allows for PTO, sick days, swaps

Healthy on-call: 6-8 engineers
  - Each engineer is on-call 1 week every 6-8 weeks
  - Sustainable long-term
  - Buffer for coverage gaps

Rotation Patterns

Primary + Secondary Model:
  Primary:   First responder, handles all pages
  Secondary: Backup if primary doesn't respond in 10 min, or for escalation
  
  Week 1: Alice (P), Bob (S)
  Week 2: Bob (P), Carol (S)
  Week 3: Carol (P), Dave (S)
  Week 4: Dave (P), Alice (S)

Handoff Protocol

End of shift checklist:
  1. Document any ongoing issues in the incident channel
  2. List any alerts that fired and their resolution
  3. Note any alerts that need investigation but are not urgent
  4. Flag any upcoming maintenance or deployments
  5. Update on-call tool (PagerDuty schedule)

Escalation Policy

escalation_policy:
  - level: 1
    target: primary_oncall
    timeout: 5_minutes
    
  - level: 2
    target: secondary_oncall
    timeout: 10_minutes
    
  - level: 3
    target: engineering_manager
    timeout: 15_minutes
    
  - level: 4
    target: vp_engineering
    timeout: 30_minutes

Alert Quality

Alert Classification

Page (wake someone up):
  - Service is down
  - Error rate > 5% for 5+ minutes
  - Customer-facing latency > 5s
  - Data pipeline SLA breach imminent

Notify (Slack, can wait):
  - Disk usage > 80%
  - Elevated error rate (1-5%)
  - Certificate expiring in 14 days
  - Non-critical job failure

Log (dashboard only):
  - CPU spike (no customer impact)
  - Cache hit rate below optimal
  - Non-critical dependency degraded

Alert Hygiene

Review alerts monthly:

Last Month Alert Report:
  Total pages:           47
  Actionable:            31 (66%)   ← target: >80%
  False positives:       12 (26%)   ← target: <10%
  Noisy (duplicate):      4 (8%)   ← target: 0%
  
  Actions:
  - Tune threshold on memory alert (caused 8 false positives)
  - Deduplicate disk space alerts
  - Add alert for the issue that required customer escalation

Runbooks

Every alert should have a linked runbook:

# Runbook: High Error Rate on Order Service

## Alert
`order_service_error_rate > 5% for 5 minutes`

## Impact
Customers may see errors during checkout

## Diagnosis Steps
1. Check error logs: `kubectl logs -l app=order-service --tail=100`
2. Check dependent services:
   - Payment service: https://grafana.internal/d/payment
   - Database: https://grafana.internal/d/postgres
3. Check recent deploys: `kubectl rollout history deploy/order-service`

## Common Causes

### Bad Deploy
Symptoms: Error spike immediately after deploy
Fix: `kubectl rollout undo deploy/order-service`

### Database Connection Exhaustion
Symptoms: "connection pool exhausted" in logs
Fix: Restart service: `kubectl rollout restart deploy/order-service`
Root cause: Investigate connection leak in next business day

### Downstream Service Failure
Symptoms: Timeout errors to payment-service or inventory-service
Fix: Check downstream service status, activate circuit breaker if needed

## Escalation
If unable to resolve within 15 minutes, escalate to secondary on-call.

Compensation and Sustainability

On-Call Compensation Models

ModelDetails
Flat weekly rate$500-1000 per on-call week
Per-incident bonusBase rate + $X per page responded to
Time-off in lieu1 comp day per on-call week
Higher base salaryOn-call expectation factored into total comp

Burnout Prevention

  • Maximum consecutive days: 7 (no multi-week rotations)
  • Follow-the-sun: For global teams, hand off to a team in another timezone
  • No-page improvements: Track and eliminate recurring pages
  • Post-incident recovery: If paged after midnight, no meetings before noon
  • Opt-out periods: Engineers can block on-call during exam weeks, moves, etc.

Anti-Patterns

Anti-PatternConsequenceFix
Same 2 engineers always on-callBurnout, knowledge hoardingMinimum 4-person rotation
No runbooksOn-call relies on tribal knowledgeEvery alert links to a runbook
> 5 pages per weekAlert fatigue, pages get ignoredTune alerts, fix noisy ones
No compensationResentment, people avoid on-callCompensate fairly
No feedback loopSame alerts page foreverMonthly alert review, action items

On-call is not a punishment. It is an investment in production reliability and in engineers’ deep system understanding. But it must be designed with the same care as any other engineering system.

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →