SLO Engineering | The Garnet Wiki

An SLO (Service Level Objective) is a target for the reliability of a service, expressed as a percentage. “99.9% of requests will complete successfully within 500ms” is an SLO. SLOs bridge the gap between business expectations and engineering implementation by making reliability a measurable, actionable target.

The SLI → SLO → SLA Stack

SLA (Service Level Agreement)
  External contract with customers
  "99.95% uptime per month, or we credit your account"
  
SLO (Service Level Objective)
  Internal target, stricter than SLA
  "99.99% availability, 99.9% of requests < 500ms"
  
SLI (Service Level Indicator)
  The measurement that feeds the SLO
  "Successful requests / total requests, measured at the load balancer"

SLO Must Be Stricter Than SLA

SLA: 99.9% (43.8 min downtime/month)
SLO: 99.95% (21.9 min downtime/month)

The gap is your safety margin. If SLO = SLA, you will breach the SLA regularly.

Choosing SLIs

Availability

SLI:  Successful requests / Total requests
      (measured at the load balancer, excludes health checks)

Good: server_errors / total_requests < 0.001

Bad:  uptime (binary: up/down doesn't capture partial failures)

Latency

SLI:  P99 response time < 500ms
      P50 response time < 100ms

Good: Request duration at the 99th percentile
Bad:  Average response time (hides tail latency)

Correctness

SLI:  Correct responses / Total responses
      (validated by periodic canary tests)

Good: End-to-end validation of response content
Bad:  HTTP 200 status only (response could be wrong)

Error Budgets

The error budget is the inverse of the SLO expressed as allowed failure:

SLO: 99.9% availability
Error budget: 0.1% = 43.8 minutes of downtime per month

Budget consumption:
  Week 1: 5-minute outage  → 5/43.8 = 11.4% consumed
  Week 2: No outages       → 11.4% consumed (cumulative)
  Week 3: 15-minute outage → 15/43.8 = 34.2% → 45.6% consumed
  Week 4: 3-minute outage  → 3/43.8 = 6.8%  → 52.4% consumed
  
  Remaining budget: 47.6% (20.9 minutes)

Error Budget Policy

error_budget_policy:
  healthy: # > 50% budget remaining
    - Deploy normally
    - Experiment with new features
    - Take calculated risks
    
  warning: # 20-50% budget remaining
    - Require rollback plans for all deploys
    - Prioritize reliability work
    - Reduce deployment frequency
    
  critical: # < 20% budget remaining
    - Freeze non-critical features
    - All engineering on reliability
    - Postmortem for every incident
    
  exhausted: # 0% budget remaining
    - Complete feature freeze
    - Dedicated reliability sprint
    - Exec review required to resume features

SLO-Based Alerting

Burn Rate Alerts

Instead of alerting on every error, alert on the rate of error budget consumption:

# Multi-window burn rate alert
alerts:
  - name: high_burn_rate_fast
    # 2% of monthly budget consumed in 1 hour
    condition: error_rate_1h > (monthly_budget * 0.02)
    severity: page
    message: "At this rate, SLO will be breached in 2 days"
    
  - name: high_burn_rate_slow
    # 5% of monthly budget consumed in 6 hours
    condition: error_rate_6h > (monthly_budget * 0.05)
    severity: ticket
    message: "Elevated error rate, investigate during business hours"
    
  - name: budget_warning
    condition: remaining_budget < 0.30
    severity: notification
    message: "Error budget below 30%, consider slowing deployments"

SLO Review Process

Monthly SLO Review

Agenda (30 minutes):
  1. SLO performance last month (5 min)
     - Were SLOs met?
     - Error budget remaining
  
  2. Incident review (10 min)
     - Budget-consuming incidents
     - Root causes and fixes
  
  3. SLO appropriateness (5 min)
     - Are SLOs too tight? (constant firefighting)
     - Too loose? (not catching user-impacting issues)
  
  4. Action items (10 min)
     - Reliability improvements
     - Monitoring gaps
     - SLO adjustments

Anti-Patterns

Anti-Pattern	Consequence	Fix
SLO = 100%	No error budget, innovation paralysis	Accept reasonable failure rate
SLO based on averages	Tail latency ignored	Use percentiles (P95, P99)
SLO without error budget policy	SLO is aspirational, not actionable	Define actions at each budget level
No SLO review process	SLOs become stale	Monthly review with stakeholders
Same SLO for all services	Critical services under-protected	Tier services, higher SLOs for critical

SLOs are the contract between your team and your users. They make the implicit expectations explicit, the unmeasured measurable, and the unactionable actionable.