ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

SLOs That Drive Real Reliability: From Error Budgets to Engineering Decisions

Implement Service Level Objectives that actually improve reliability. Covers SLI selection, error budget policies, burn rate alerting, and the organizational negotiations that make SLOs work in practice.

Here is what happens without SLOs: an engineer asks “is our service reliable enough?” and the answer is always either “I think so” or “probably not.” SLOs replace opinion with math. They give you a number — this is how reliable we promise to be — and a budget — this is how much unreliability we can tolerate before we stop shipping features and fix things.

The concept is simple. The implementation is not. Most teams fail at SLOs not because they choose the wrong metrics, but because they fail to connect SLOs to engineering decisions. An SLO that nobody acts on when it is violated is just a dashboard.


The Hierarchy: SLI → SLO → SLA → Error Budget

SLI (Service Level Indicator)
  A measurement of service behavior.
  Example: "The proportion of HTTP requests that return in < 200ms"

SLO (Service Level Objective)
  A target for an SLI over a time window.
  Example: "99.9% of requests return in < 200ms over 30 days"

SLA (Service Level Agreement)
  A contractual commitment with consequences for violation.
  Example: "If availability drops below 99.9%, customer gets credits"

Error Budget
  The allowed unreliability within the SLO.
  Example: 99.9% availability = 0.1% error budget = 43.2 min/month of downtime
SLO TargetError Budget (30 days)What This Means
99%7.3 hoursGenerous. Multiple incidents tolerated.
99.5%3.6 hoursModerate. Standard for internal services.
99.9%43.2 minutesTight. Most production-facing services.
99.95%21.6 minutesVery tight. Critical path services only.
99.99%4.3 minutesExtreme. Requires redundancy investment.

The most important insight about SLOs: Your SLO should NOT be as high as possible. It should be as high as your users need and your team can sustain. Setting 99.99% when your team can deliver 99.9% means perpetual failure and demoralization.


Choosing the Right SLIs

The Four Golden Signals

SignalWhat It MeasuresSLI Formula
AvailabilityIs the service responding?successful_requests / total_requests
LatencyHow fast is it responding?requests_under_threshold / total_requests
Error rateHow often does it fail?error_requests / total_requests
ThroughputHow much can it handle?requests_processed / time_window

SLI Selection by Service Type

Service TypePrimary SLISecondary SLI
API gatewayAvailability + latencyError rate
DatabaseQuery latencyConnection availability
Message queueDelivery latencyMessage loss rate
Batch pipelineCompletion on timeData freshness
AuthenticationAvailabilityLatency (login time)
CDN/static assetsCache hit rateTTFB latency

Prometheus SLI Queries

# Availability SLI: proportion of non-5xx responses
sum(rate(http_requests_total{status!~"5.*"}[5m]))
/
sum(rate(http_requests_total[5m]))

# Latency SLI: proportion of requests under 200ms
sum(rate(http_request_duration_seconds_bucket{le="0.2"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))

# Error budget remaining (30-day window)
1 - (
  (1 - (sum(increase(http_requests_total{status!~"5.*"}[30d])) / sum(increase(http_requests_total[30d]))))
  /
  (1 - 0.999)  # SLO target
)

Error Budget Policy

The error budget policy is where SLOs become powerful. It is an agreement between engineering and product about what happens when reliability degrades.

error_budget_policy:
  budget_remaining_100_to_75:
    name: "Healthy"
    actions:
      - "Ship features at normal velocity"
      - "Run experiments and A/B tests"
      - "Approve risky but valuable changes"

  budget_remaining_75_to_50:
    name: "Caution"
    actions:
      - "Continue feature work but increase review rigor"
      - "Prioritize reliability improvements in next sprint"
      - "Review recent incidents for patterns"

  budget_remaining_50_to_25:
    name: "Warning"
    actions:
      - "Pause non-critical feature work"
      - "Dedicate 50% of sprint to reliability"
      - "Daily error budget review"
      - "Escalate to engineering leadership"

  budget_remaining_25_to_0:
    name: "Critical"
    actions:
      - "Freeze all feature deployments"
      - "100% focus on reliability"
      - "All hands on root cause analysis"
      - "Post-mortem for every incident"

  budget_exhausted:
    name: "Frozen"
    actions:
      - "No production changes except reliability fixes"
      - "Remains frozen until budget rebuilds to 25%"
      - "Executive escalation required to override"

The political reality: The most important conversation is with your product manager. When the error budget is exhausted and you freeze feature deployments, product will push back. Having a documented, pre-agreed error budget policy makes this a process decision, not a political fight.


Burn Rate Alerting

Traditional threshold alerts (“error rate > 1%”) fire too late or too often. Burn rate alerting answers: “At the current rate of errors, when will we exhaust our error budget?”

Burn rate = actual error rate / allowed error rate

Example:
  SLO: 99.9% (allowed 0.1% errors)
  Current error rate: 0.5%
  Burn rate: 0.5% / 0.1% = 5x

  At 5x burn rate, a 30-day error budget will be consumed in 6 days.

Multi-Window Burn Rate Alerts

AlertBurn RateLong WindowShort WindowMeaning
Page (P1)14.4x1 hour5 minBurning budget in ~2 days. Immediate action.
Page (P2)6x6 hours30 minBurning budget in ~5 days. Act within hours.
Ticket3x3 days6 hoursBudget pressure. Address within days.
Ticket1x7 days1 dayGradual burn. Plan remediation.
# Prometheus alerting rule: 14.4x burn rate
groups:
  - name: slo-burn-rate
    rules:
      - alert: HighErrorBudgetBurn
        expr: |
          (
            1 - (sum(rate(http_requests_total{status!~"5.*",job="api"}[1h]))
            / sum(rate(http_requests_total{job="api"}[1h])))
          )
          /
          (1 - 0.999)
          > 14.4
          and
          (
            1 - (sum(rate(http_requests_total{status!~"5.*",job="api"}[5m]))
            / sum(rate(http_requests_total{job="api"}[5m])))
          )
          /
          (1 - 0.999)
          > 14.4
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Error budget burning 14.4x faster than allowed"
          description: "At current rate, 30-day error budget will be exhausted in ~2 days"

SLO Review Process

Monthly SLO review with stakeholders:

Agenda ItemDurationParticipants
Error budget status for each service10 minEng lead, PM, SRE
Incidents that consumed budget15 minOn-call engineer
Actions taken / not taken based on policy10 minEng manager
SLO target appropriateness10 minPM, Eng lead
Next month priorities15 minAll

Implementation Checklist

  • Identify your 3-5 most critical services that need SLOs
  • For each service, choose 1-2 SLIs (availability and latency are almost always relevant)
  • Set initial SLO targets based on historical data (what you actually achieve, not what you aspire to)
  • Build Prometheus/Grafana dashboards showing SLI measurements and error budget remaining
  • Implement multi-window burn rate alerting (replaces threshold-based alerts)
  • Write an error budget policy and get sign-off from product management
  • Practice the error budget policy: next time budget hits 50%, follow the documented actions
  • Conduct monthly SLO reviews with engineering, product, and SRE stakeholders
  • Revisit SLO targets quarterly: too easy? Tighten. Too hard? Loosen. Never perfect on the first try.
Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →