SLO Engineering
Define, measure, and manage Service Level Objectives that align engineering priorities with user expectations. Covers SLI selection, error budget policy, SLO-based alerting, and the organizational process that makes SLOs actionable.
An SLO (Service Level Objective) is a target for the reliability of a service, expressed as a percentage. “99.9% of requests will complete successfully within 500ms” is an SLO. SLOs bridge the gap between business expectations and engineering implementation by making reliability a measurable, actionable target.
The SLI → SLO → SLA Stack
SLA (Service Level Agreement)
External contract with customers
"99.95% uptime per month, or we credit your account"
SLO (Service Level Objective)
Internal target, stricter than SLA
"99.99% availability, 99.9% of requests < 500ms"
SLI (Service Level Indicator)
The measurement that feeds the SLO
"Successful requests / total requests, measured at the load balancer"
SLO Must Be Stricter Than SLA
SLA: 99.9% (43.8 min downtime/month)
SLO: 99.95% (21.9 min downtime/month)
The gap is your safety margin. If SLO = SLA, you will breach the SLA regularly.
Choosing SLIs
Availability
SLI: Successful requests / Total requests
(measured at the load balancer, excludes health checks)
Good: server_errors / total_requests < 0.001
Bad: uptime (binary: up/down doesn't capture partial failures)
Latency
SLI: P99 response time < 500ms
P50 response time < 100ms
Good: Request duration at the 99th percentile
Bad: Average response time (hides tail latency)
Correctness
SLI: Correct responses / Total responses
(validated by periodic canary tests)
Good: End-to-end validation of response content
Bad: HTTP 200 status only (response could be wrong)
Error Budgets
The error budget is the inverse of the SLO expressed as allowed failure:
SLO: 99.9% availability
Error budget: 0.1% = 43.8 minutes of downtime per month
Budget consumption:
Week 1: 5-minute outage → 5/43.8 = 11.4% consumed
Week 2: No outages → 11.4% consumed (cumulative)
Week 3: 15-minute outage → 15/43.8 = 34.2% → 45.6% consumed
Week 4: 3-minute outage → 3/43.8 = 6.8% → 52.4% consumed
Remaining budget: 47.6% (20.9 minutes)
Error Budget Policy
error_budget_policy:
healthy: # > 50% budget remaining
- Deploy normally
- Experiment with new features
- Take calculated risks
warning: # 20-50% budget remaining
- Require rollback plans for all deploys
- Prioritize reliability work
- Reduce deployment frequency
critical: # < 20% budget remaining
- Freeze non-critical features
- All engineering on reliability
- Postmortem for every incident
exhausted: # 0% budget remaining
- Complete feature freeze
- Dedicated reliability sprint
- Exec review required to resume features
SLO-Based Alerting
Burn Rate Alerts
Instead of alerting on every error, alert on the rate of error budget consumption:
# Multi-window burn rate alert
alerts:
- name: high_burn_rate_fast
# 2% of monthly budget consumed in 1 hour
condition: error_rate_1h > (monthly_budget * 0.02)
severity: page
message: "At this rate, SLO will be breached in 2 days"
- name: high_burn_rate_slow
# 5% of monthly budget consumed in 6 hours
condition: error_rate_6h > (monthly_budget * 0.05)
severity: ticket
message: "Elevated error rate, investigate during business hours"
- name: budget_warning
condition: remaining_budget < 0.30
severity: notification
message: "Error budget below 30%, consider slowing deployments"
SLO Review Process
Monthly SLO Review
Agenda (30 minutes):
1. SLO performance last month (5 min)
- Were SLOs met?
- Error budget remaining
2. Incident review (10 min)
- Budget-consuming incidents
- Root causes and fixes
3. SLO appropriateness (5 min)
- Are SLOs too tight? (constant firefighting)
- Too loose? (not catching user-impacting issues)
4. Action items (10 min)
- Reliability improvements
- Monitoring gaps
- SLO adjustments
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| SLO = 100% | No error budget, innovation paralysis | Accept reasonable failure rate |
| SLO based on averages | Tail latency ignored | Use percentiles (P95, P99) |
| SLO without error budget policy | SLO is aspirational, not actionable | Define actions at each budget level |
| No SLO review process | SLOs become stale | Monthly review with stakeholders |
| Same SLO for all services | Critical services under-protected | Tier services, higher SLOs for critical |
SLOs are the contract between your team and your users. They make the implicit expectations explicit, the unmeasured measurable, and the unactionable actionable.