ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

Toil Budgets and Elimination

Measure, budget, and systematically eliminate toil in SRE organizations. Covers toil taxonomy, measurement frameworks, automation ROI calculation, toil elimination strategies, and the organizational patterns that prevent toil from consuming engineering capacity.

Google SRE defines toil as manual, repetitive, automatable work that scales linearly with the size of the service. If your team spends 60% of their time on toil, they have 40% for engineering work. The SRE target is less than 50% toil — ideally under 30%. Every percentage point of toil reduction is reclaimed engineering capacity.


Toil Taxonomy

Characteristics of toil:
  Manual:       Requires a human to run a script, click a button
  Repetitive:   Done more than once, not a novel task
  Automatable:  A machine could do it (if someone built the automation)
  Reactive:     Triggered by external event, not proactive choice
  No value:     Does not improve the service permanently
  Scales with service: More users = more toil

NOT toil:
  ✅ Architecture design (creative, non-repetitive)
  ✅ Code review (valuable human judgment)
  ✅ Incident postmortems (permanent improvement)
  ✅ Building automation (eliminates future toil)
  ✅ On-call with novel incidents (requires engineering thinking)

Measuring Toil

Toil Tracking

# Track toil weekly per team member
weekly_toil_log:
  engineer: "Alice"
  week: "2026-W10"
  total_hours: 40
  toil_hours: 14
  toil_percentage: 35%
  
  toil_breakdown:
    - activity: "Manual certificate rotation"
      hours: 3
      frequency: weekly
      automatable: true
      automation_effort: "2 days"
      
    - activity: "Scaling database replicas"
      hours: 2
      frequency: twice_weekly
      automatable: true
      automation_effort: "1 day"
      
    - activity: "Answering capacity requests via ticket"
      hours: 4
      frequency: daily
      automatable: true
      automation_effort: "1 week"
      
    - activity: "Debugging user-reported 500 errors"
      hours: 5
      frequency: daily
      automatable: partially
      automation_effort: "2 weeks (better error handling)"

Automation ROI Calculation

Toil: Manual certificate rotation
  Time per occurrence: 45 minutes
  Frequency: 12 times per month
  Monthly cost: 12 × 45 min = 9 hours
  Annual cost: 9 × 12 = 108 hours × $150/hr = $16,200

Automation cost:
  Development: 16 hours = $2,400
  Testing: 4 hours = $600
  Maintenance: 2 hours/month = $3,600/year
  Total first year: $6,600

ROI:
  First year savings: $16,200 - $6,600 = $9,600 (59% ROI)
  Second year savings: $16,200 - $3,600 = $12,600 per year
  Payback period: ~5 months

XKCD rule of thumb:
  If a task takes 30 min and you do it weekly:
  You can spend up to 26 hours automating it (over 5 years)

Elimination Strategies

Automation Ladder

Level 0: Fully manual
  Human does everything, every time
  
Level 1: Documented procedure
  Human follows a runbook step by step
  
Level 2: Partially automated
  Script handles most steps, human monitors
  
Level 3: Fully automated with human trigger
  Human runs script, it does everything
  
Level 4: Fully automated with machine trigger
  Event triggers automation, human reviews result
  
Level 5: Fully autonomous
  Event triggers automation, no human involved
  Human alerted only on failure

Goal: Move every toil item up the ladder.
Most items should be Level 3+ within 6 months.

Self-Service Platforms

# Instead of tickets → automation → done
# Build self-service → engineers do it themselves

self_service_catalog:
  - name: "Create new database"
    old_process: "File ticket → DBA reviews → DBA provisions → 3 days"
    new_process: "Click form → automated provisioning → 5 minutes"
    toil_eliminated: 4 hours/week (DBA team)
    
  - name: "Request cloud IAM role"
    old_process: "File ticket → security review → manual creation → 2 days"
    new_process: "PR to IAM repo → automated review + Terraform apply → 30 min"
    toil_eliminated: 6 hours/week (security team)

Anti-Patterns

Anti-PatternConsequenceFix
Not measuring toilCannot improve what you don’t measureWeekly toil tracking per engineer
Automating only easy toilHigh-impact toil remainsPrioritize by ROI, not difficulty
Toil > 50% accepted as normalNo engineering capacity for improvementSRE contract: toil must stay < 50%
Band-aid automationAutomates symptoms, not root causeFix underlying system issues
No toil budget in sprint planningToil elimination never prioritizedDedicated automation sprint capacity

Toil is the tax on engineering capacity. Every hour spent on toil is an hour not spent on reliability improvements, feature work, or career development. Measure it, budget it, and systematically eliminate it.

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →