SLOs That Drive Real Reliability: From Error Budgets to Engineering Decisions

Here is what happens without SLOs: an engineer asks “is our service reliable enough?” and the answer is always either “I think so” or “probably not.” SLOs replace opinion with math. They give you a number — this is how reliable we promise to be — and a budget — this is how much unreliability we can tolerate before we stop shipping features and fix things.

The concept is simple. The implementation is not. Most teams fail at SLOs not because they choose the wrong metrics, but because they fail to connect SLOs to engineering decisions. An SLO that nobody acts on when it is violated is just a dashboard.

The Hierarchy: SLI → SLO → SLA → Error Budget

SLI (Service Level Indicator)
  A measurement of service behavior.
  Example: "The proportion of HTTP requests that return in < 200ms"

SLO (Service Level Objective)
  A target for an SLI over a time window.
  Example: "99.9% of requests return in < 200ms over 30 days"

SLA (Service Level Agreement)
  A contractual commitment with consequences for violation.
  Example: "If availability drops below 99.9%, customer gets credits"

Error Budget
  The allowed unreliability within the SLO.
  Example: 99.9% availability = 0.1% error budget = 43.2 min/month of downtime

SLO Target	Error Budget (30 days)	What This Means
99%	7.3 hours	Generous. Multiple incidents tolerated.
99.5%	3.6 hours	Moderate. Standard for internal services.
99.9%	43.2 minutes	Tight. Most production-facing services.
99.95%	21.6 minutes	Very tight. Critical path services only.
99.99%	4.3 minutes	Extreme. Requires redundancy investment.

The most important insight about SLOs: Your SLO should NOT be as high as possible. It should be as high as your users need and your team can sustain. Setting 99.99% when your team can deliver 99.9% means perpetual failure and demoralization.

Choosing the Right SLIs

The Four Golden Signals

Signal	What It Measures	SLI Formula
Availability	Is the service responding?	`successful_requests / total_requests`
Latency	How fast is it responding?	`requests_under_threshold / total_requests`
Error rate	How often does it fail?	`error_requests / total_requests`
Throughput	How much can it handle?	`requests_processed / time_window`

SLI Selection by Service Type

Service Type	Primary SLI	Secondary SLI
API gateway	Availability + latency	Error rate
Database	Query latency	Connection availability
Message queue	Delivery latency	Message loss rate
Batch pipeline	Completion on time	Data freshness
Authentication	Availability	Latency (login time)
CDN/static assets	Cache hit rate	TTFB latency

Prometheus SLI Queries

# Availability SLI: proportion of non-5xx responses
sum(rate(http_requests_total{status!~"5.*"}[5m]))
/
sum(rate(http_requests_total[5m]))

# Latency SLI: proportion of requests under 200ms
sum(rate(http_request_duration_seconds_bucket{le="0.2"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))

# Error budget remaining (30-day window)
1 - (
  (1 - (sum(increase(http_requests_total{status!~"5.*"}[30d])) / sum(increase(http_requests_total[30d]))))
  /
  (1 - 0.999)  # SLO target
)

Error Budget Policy

The error budget policy is where SLOs become powerful. It is an agreement between engineering and product about what happens when reliability degrades.

error_budget_policy:
  budget_remaining_100_to_75:
    name: "Healthy"
    actions:
      - "Ship features at normal velocity"
      - "Run experiments and A/B tests"
      - "Approve risky but valuable changes"

  budget_remaining_75_to_50:
    name: "Caution"
    actions:
      - "Continue feature work but increase review rigor"
      - "Prioritize reliability improvements in next sprint"
      - "Review recent incidents for patterns"

  budget_remaining_50_to_25:
    name: "Warning"
    actions:
      - "Pause non-critical feature work"
      - "Dedicate 50% of sprint to reliability"
      - "Daily error budget review"
      - "Escalate to engineering leadership"

  budget_remaining_25_to_0:
    name: "Critical"
    actions:
      - "Freeze all feature deployments"
      - "100% focus on reliability"
      - "All hands on root cause analysis"
      - "Post-mortem for every incident"

  budget_exhausted:
    name: "Frozen"
    actions:
      - "No production changes except reliability fixes"
      - "Remains frozen until budget rebuilds to 25%"
      - "Executive escalation required to override"

The political reality: The most important conversation is with your product manager. When the error budget is exhausted and you freeze feature deployments, product will push back. Having a documented, pre-agreed error budget policy makes this a process decision, not a political fight.

Burn Rate Alerting

Traditional threshold alerts (“error rate > 1%”) fire too late or too often. Burn rate alerting answers: “At the current rate of errors, when will we exhaust our error budget?”

Burn rate = actual error rate / allowed error rate

Example:
  SLO: 99.9% (allowed 0.1% errors)
  Current error rate: 0.5%
  Burn rate: 0.5% / 0.1% = 5x

  At 5x burn rate, a 30-day error budget will be consumed in 6 days.

Multi-Window Burn Rate Alerts

Alert	Burn Rate	Long Window	Short Window	Meaning
Page (P1)	14.4x	1 hour	5 min	Burning budget in ~2 days. Immediate action.
Page (P2)	6x	6 hours	30 min	Burning budget in ~5 days. Act within hours.
Ticket	3x	3 days	6 hours	Budget pressure. Address within days.
Ticket	1x	7 days	1 day	Gradual burn. Plan remediation.

# Prometheus alerting rule: 14.4x burn rate
groups:
  - name: slo-burn-rate
    rules:
      - alert: HighErrorBudgetBurn
        expr: |
          (
            1 - (sum(rate(http_requests_total{status!~"5.*",job="api"}[1h]))
            / sum(rate(http_requests_total{job="api"}[1h])))
          )
          /
          (1 - 0.999)
          > 14.4
          and
          (
            1 - (sum(rate(http_requests_total{status!~"5.*",job="api"}[5m]))
            / sum(rate(http_requests_total{job="api"}[5m])))
          )
          /
          (1 - 0.999)
          > 14.4
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Error budget burning 14.4x faster than allowed"
          description: "At current rate, 30-day error budget will be exhausted in ~2 days"

SLO Review Process

Monthly SLO review with stakeholders:

Agenda Item	Duration	Participants
Error budget status for each service	10 min	Eng lead, PM, SRE
Incidents that consumed budget	15 min	On-call engineer
Actions taken / not taken based on policy	10 min	Eng manager
SLO target appropriateness	10 min	PM, Eng lead
Next month priorities	15 min	All

Implementation Checklist

Identify your 3-5 most critical services that need SLOs
For each service, choose 1-2 SLIs (availability and latency are almost always relevant)
Set initial SLO targets based on historical data (what you actually achieve, not what you aspire to)
Build Prometheus/Grafana dashboards showing SLI measurements and error budget remaining
Implement multi-window burn rate alerting (replaces threshold-based alerts)
Write an error budget policy and get sign-off from product management
Practice the error budget policy: next time budget hits 50%, follow the documented actions
Conduct monthly SLO reviews with engineering, product, and SRE stakeholders
Revisit SLO targets quarterly: too easy? Tighten. Too hard? Loosen. Never perfect on the first try.