SLOs That Drive Real Reliability: From Error Budgets to Engineering Decisions
Implement Service Level Objectives that actually improve reliability. Covers SLI selection, error budget policies, burn rate alerting, and the organizational negotiations that make SLOs work in practice.
Here is what happens without SLOs: an engineer asks “is our service reliable enough?” and the answer is always either “I think so” or “probably not.” SLOs replace opinion with math. They give you a number — this is how reliable we promise to be — and a budget — this is how much unreliability we can tolerate before we stop shipping features and fix things.
The concept is simple. The implementation is not. Most teams fail at SLOs not because they choose the wrong metrics, but because they fail to connect SLOs to engineering decisions. An SLO that nobody acts on when it is violated is just a dashboard.
The Hierarchy: SLI → SLO → SLA → Error Budget
SLI (Service Level Indicator)
A measurement of service behavior.
Example: "The proportion of HTTP requests that return in < 200ms"
SLO (Service Level Objective)
A target for an SLI over a time window.
Example: "99.9% of requests return in < 200ms over 30 days"
SLA (Service Level Agreement)
A contractual commitment with consequences for violation.
Example: "If availability drops below 99.9%, customer gets credits"
Error Budget
The allowed unreliability within the SLO.
Example: 99.9% availability = 0.1% error budget = 43.2 min/month of downtime
| SLO Target | Error Budget (30 days) | What This Means |
|---|---|---|
| 99% | 7.3 hours | Generous. Multiple incidents tolerated. |
| 99.5% | 3.6 hours | Moderate. Standard for internal services. |
| 99.9% | 43.2 minutes | Tight. Most production-facing services. |
| 99.95% | 21.6 minutes | Very tight. Critical path services only. |
| 99.99% | 4.3 minutes | Extreme. Requires redundancy investment. |
The most important insight about SLOs: Your SLO should NOT be as high as possible. It should be as high as your users need and your team can sustain. Setting 99.99% when your team can deliver 99.9% means perpetual failure and demoralization.
Choosing the Right SLIs
The Four Golden Signals
| Signal | What It Measures | SLI Formula |
|---|---|---|
| Availability | Is the service responding? | successful_requests / total_requests |
| Latency | How fast is it responding? | requests_under_threshold / total_requests |
| Error rate | How often does it fail? | error_requests / total_requests |
| Throughput | How much can it handle? | requests_processed / time_window |
SLI Selection by Service Type
| Service Type | Primary SLI | Secondary SLI |
|---|---|---|
| API gateway | Availability + latency | Error rate |
| Database | Query latency | Connection availability |
| Message queue | Delivery latency | Message loss rate |
| Batch pipeline | Completion on time | Data freshness |
| Authentication | Availability | Latency (login time) |
| CDN/static assets | Cache hit rate | TTFB latency |
Prometheus SLI Queries
# Availability SLI: proportion of non-5xx responses
sum(rate(http_requests_total{status!~"5.*"}[5m]))
/
sum(rate(http_requests_total[5m]))
# Latency SLI: proportion of requests under 200ms
sum(rate(http_request_duration_seconds_bucket{le="0.2"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))
# Error budget remaining (30-day window)
1 - (
(1 - (sum(increase(http_requests_total{status!~"5.*"}[30d])) / sum(increase(http_requests_total[30d]))))
/
(1 - 0.999) # SLO target
)
Error Budget Policy
The error budget policy is where SLOs become powerful. It is an agreement between engineering and product about what happens when reliability degrades.
error_budget_policy:
budget_remaining_100_to_75:
name: "Healthy"
actions:
- "Ship features at normal velocity"
- "Run experiments and A/B tests"
- "Approve risky but valuable changes"
budget_remaining_75_to_50:
name: "Caution"
actions:
- "Continue feature work but increase review rigor"
- "Prioritize reliability improvements in next sprint"
- "Review recent incidents for patterns"
budget_remaining_50_to_25:
name: "Warning"
actions:
- "Pause non-critical feature work"
- "Dedicate 50% of sprint to reliability"
- "Daily error budget review"
- "Escalate to engineering leadership"
budget_remaining_25_to_0:
name: "Critical"
actions:
- "Freeze all feature deployments"
- "100% focus on reliability"
- "All hands on root cause analysis"
- "Post-mortem for every incident"
budget_exhausted:
name: "Frozen"
actions:
- "No production changes except reliability fixes"
- "Remains frozen until budget rebuilds to 25%"
- "Executive escalation required to override"
The political reality: The most important conversation is with your product manager. When the error budget is exhausted and you freeze feature deployments, product will push back. Having a documented, pre-agreed error budget policy makes this a process decision, not a political fight.
Burn Rate Alerting
Traditional threshold alerts (“error rate > 1%”) fire too late or too often. Burn rate alerting answers: “At the current rate of errors, when will we exhaust our error budget?”
Burn rate = actual error rate / allowed error rate
Example:
SLO: 99.9% (allowed 0.1% errors)
Current error rate: 0.5%
Burn rate: 0.5% / 0.1% = 5x
At 5x burn rate, a 30-day error budget will be consumed in 6 days.
Multi-Window Burn Rate Alerts
| Alert | Burn Rate | Long Window | Short Window | Meaning |
|---|---|---|---|---|
| Page (P1) | 14.4x | 1 hour | 5 min | Burning budget in ~2 days. Immediate action. |
| Page (P2) | 6x | 6 hours | 30 min | Burning budget in ~5 days. Act within hours. |
| Ticket | 3x | 3 days | 6 hours | Budget pressure. Address within days. |
| Ticket | 1x | 7 days | 1 day | Gradual burn. Plan remediation. |
# Prometheus alerting rule: 14.4x burn rate
groups:
- name: slo-burn-rate
rules:
- alert: HighErrorBudgetBurn
expr: |
(
1 - (sum(rate(http_requests_total{status!~"5.*",job="api"}[1h]))
/ sum(rate(http_requests_total{job="api"}[1h])))
)
/
(1 - 0.999)
> 14.4
and
(
1 - (sum(rate(http_requests_total{status!~"5.*",job="api"}[5m]))
/ sum(rate(http_requests_total{job="api"}[5m])))
)
/
(1 - 0.999)
> 14.4
for: 2m
labels:
severity: critical
annotations:
summary: "Error budget burning 14.4x faster than allowed"
description: "At current rate, 30-day error budget will be exhausted in ~2 days"
SLO Review Process
Monthly SLO review with stakeholders:
| Agenda Item | Duration | Participants |
|---|---|---|
| Error budget status for each service | 10 min | Eng lead, PM, SRE |
| Incidents that consumed budget | 15 min | On-call engineer |
| Actions taken / not taken based on policy | 10 min | Eng manager |
| SLO target appropriateness | 10 min | PM, Eng lead |
| Next month priorities | 15 min | All |
Implementation Checklist
- Identify your 3-5 most critical services that need SLOs
- For each service, choose 1-2 SLIs (availability and latency are almost always relevant)
- Set initial SLO targets based on historical data (what you actually achieve, not what you aspire to)
- Build Prometheus/Grafana dashboards showing SLI measurements and error budget remaining
- Implement multi-window burn rate alerting (replaces threshold-based alerts)
- Write an error budget policy and get sign-off from product management
- Practice the error budget policy: next time budget hits 50%, follow the documented actions
- Conduct monthly SLO reviews with engineering, product, and SRE stakeholders
- Revisit SLO targets quarterly: too easy? Tighten. Too hard? Loosen. Never perfect on the first try.