Toil Budgets and Elimination

Google SRE defines toil as manual, repetitive, automatable work that scales linearly with the size of the service. If your team spends 60% of their time on toil, they have 40% for engineering work. The SRE target is less than 50% toil — ideally under 30%. Every percentage point of toil reduction is reclaimed engineering capacity.

Toil Taxonomy

Characteristics of toil:
  Manual:       Requires a human to run a script, click a button
  Repetitive:   Done more than once, not a novel task
  Automatable:  A machine could do it (if someone built the automation)
  Reactive:     Triggered by external event, not proactive choice
  No value:     Does not improve the service permanently
  Scales with service: More users = more toil

NOT toil:
  ✅ Architecture design (creative, non-repetitive)
  ✅ Code review (valuable human judgment)
  ✅ Incident postmortems (permanent improvement)
  ✅ Building automation (eliminates future toil)
  ✅ On-call with novel incidents (requires engineering thinking)

Measuring Toil

Toil Tracking

# Track toil weekly per team member
weekly_toil_log:
  engineer: "Alice"
  week: "2026-W10"
  total_hours: 40
  toil_hours: 14
  toil_percentage: 35%
  
  toil_breakdown:
    - activity: "Manual certificate rotation"
      hours: 3
      frequency: weekly
      automatable: true
      automation_effort: "2 days"
      
    - activity: "Scaling database replicas"
      hours: 2
      frequency: twice_weekly
      automatable: true
      automation_effort: "1 day"
      
    - activity: "Answering capacity requests via ticket"
      hours: 4
      frequency: daily
      automatable: true
      automation_effort: "1 week"
      
    - activity: "Debugging user-reported 500 errors"
      hours: 5
      frequency: daily
      automatable: partially
      automation_effort: "2 weeks (better error handling)"

Automation ROI Calculation

Toil: Manual certificate rotation
  Time per occurrence: 45 minutes
  Frequency: 12 times per month
  Monthly cost: 12 × 45 min = 9 hours
  Annual cost: 9 × 12 = 108 hours × $150/hr = $16,200

Automation cost:
  Development: 16 hours = $2,400
  Testing: 4 hours = $600
  Maintenance: 2 hours/month = $3,600/year
  Total first year: $6,600

ROI:
  First year savings: $16,200 - $6,600 = $9,600 (59% ROI)
  Second year savings: $16,200 - $3,600 = $12,600 per year
  Payback period: ~5 months

XKCD rule of thumb:
  If a task takes 30 min and you do it weekly:
  You can spend up to 26 hours automating it (over 5 years)

Elimination Strategies

Automation Ladder

Level 0: Fully manual
  Human does everything, every time
  
Level 1: Documented procedure
  Human follows a runbook step by step
  
Level 2: Partially automated
  Script handles most steps, human monitors
  
Level 3: Fully automated with human trigger
  Human runs script, it does everything
  
Level 4: Fully automated with machine trigger
  Event triggers automation, human reviews result
  
Level 5: Fully autonomous
  Event triggers automation, no human involved
  Human alerted only on failure

Goal: Move every toil item up the ladder.
Most items should be Level 3+ within 6 months.

Self-Service Platforms

# Instead of tickets → automation → done
# Build self-service → engineers do it themselves

self_service_catalog:
  - name: "Create new database"
    old_process: "File ticket → DBA reviews → DBA provisions → 3 days"
    new_process: "Click form → automated provisioning → 5 minutes"
    toil_eliminated: 4 hours/week (DBA team)
    
  - name: "Request cloud IAM role"
    old_process: "File ticket → security review → manual creation → 2 days"
    new_process: "PR to IAM repo → automated review + Terraform apply → 30 min"
    toil_eliminated: 6 hours/week (security team)

Anti-Patterns

Anti-Pattern	Consequence	Fix
Not measuring toil	Cannot improve what you don’t measure	Weekly toil tracking per engineer
Automating only easy toil	High-impact toil remains	Prioritize by ROI, not difficulty
Toil > 50% accepted as normal	No engineering capacity for improvement	SRE contract: toil must stay < 50%
Band-aid automation	Automates symptoms, not root cause	Fix underlying system issues
No toil budget in sprint planning	Toil elimination never prioritized	Dedicated automation sprint capacity

Toil is the tax on engineering capacity. Every hour spent on toil is an hour not spent on reliability improvements, feature work, or career development. Measure it, budget it, and systematically eliminate it.