ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

SLO Engineering

Define, measure, and manage Service Level Objectives that align engineering priorities with user expectations. Covers SLI selection, error budget policy, SLO-based alerting, and the organizational process that makes SLOs actionable.

An SLO (Service Level Objective) is a target for the reliability of a service, expressed as a percentage. “99.9% of requests will complete successfully within 500ms” is an SLO. SLOs bridge the gap between business expectations and engineering implementation by making reliability a measurable, actionable target.


The SLI → SLO → SLA Stack

SLA (Service Level Agreement)
  External contract with customers
  "99.95% uptime per month, or we credit your account"
  
SLO (Service Level Objective)
  Internal target, stricter than SLA
  "99.99% availability, 99.9% of requests < 500ms"
  
SLI (Service Level Indicator)
  The measurement that feeds the SLO
  "Successful requests / total requests, measured at the load balancer"

SLO Must Be Stricter Than SLA

SLA: 99.9% (43.8 min downtime/month)
SLO: 99.95% (21.9 min downtime/month)

The gap is your safety margin. If SLO = SLA, you will breach the SLA regularly.

Choosing SLIs

Availability

SLI:  Successful requests / Total requests
      (measured at the load balancer, excludes health checks)

Good: server_errors / total_requests < 0.001

Bad:  uptime (binary: up/down doesn't capture partial failures)

Latency

SLI:  P99 response time < 500ms
      P50 response time < 100ms

Good: Request duration at the 99th percentile
Bad:  Average response time (hides tail latency)

Correctness

SLI:  Correct responses / Total responses
      (validated by periodic canary tests)

Good: End-to-end validation of response content
Bad:  HTTP 200 status only (response could be wrong)

Error Budgets

The error budget is the inverse of the SLO expressed as allowed failure:

SLO: 99.9% availability
Error budget: 0.1% = 43.8 minutes of downtime per month

Budget consumption:
  Week 1: 5-minute outage  → 5/43.8 = 11.4% consumed
  Week 2: No outages       → 11.4% consumed (cumulative)
  Week 3: 15-minute outage → 15/43.8 = 34.2% → 45.6% consumed
  Week 4: 3-minute outage  → 3/43.8 = 6.8%  → 52.4% consumed
  
  Remaining budget: 47.6% (20.9 minutes)

Error Budget Policy

error_budget_policy:
  healthy: # > 50% budget remaining
    - Deploy normally
    - Experiment with new features
    - Take calculated risks
    
  warning: # 20-50% budget remaining
    - Require rollback plans for all deploys
    - Prioritize reliability work
    - Reduce deployment frequency
    
  critical: # < 20% budget remaining
    - Freeze non-critical features
    - All engineering on reliability
    - Postmortem for every incident
    
  exhausted: # 0% budget remaining
    - Complete feature freeze
    - Dedicated reliability sprint
    - Exec review required to resume features

SLO-Based Alerting

Burn Rate Alerts

Instead of alerting on every error, alert on the rate of error budget consumption:

# Multi-window burn rate alert
alerts:
  - name: high_burn_rate_fast
    # 2% of monthly budget consumed in 1 hour
    condition: error_rate_1h > (monthly_budget * 0.02)
    severity: page
    message: "At this rate, SLO will be breached in 2 days"
    
  - name: high_burn_rate_slow
    # 5% of monthly budget consumed in 6 hours
    condition: error_rate_6h > (monthly_budget * 0.05)
    severity: ticket
    message: "Elevated error rate, investigate during business hours"
    
  - name: budget_warning
    condition: remaining_budget < 0.30
    severity: notification
    message: "Error budget below 30%, consider slowing deployments"

SLO Review Process

Monthly SLO Review

Agenda (30 minutes):
  1. SLO performance last month (5 min)
     - Were SLOs met?
     - Error budget remaining
  
  2. Incident review (10 min)
     - Budget-consuming incidents
     - Root causes and fixes
  
  3. SLO appropriateness (5 min)
     - Are SLOs too tight? (constant firefighting)
     - Too loose? (not catching user-impacting issues)
  
  4. Action items (10 min)
     - Reliability improvements
     - Monitoring gaps
     - SLO adjustments

Anti-Patterns

Anti-PatternConsequenceFix
SLO = 100%No error budget, innovation paralysisAccept reasonable failure rate
SLO based on averagesTail latency ignoredUse percentiles (P95, P99)
SLO without error budget policySLO is aspirational, not actionableDefine actions at each budget level
No SLO review processSLOs become staleMonthly review with stakeholders
Same SLO for all servicesCritical services under-protectedTier services, higher SLOs for critical

SLOs are the contract between your team and your users. They make the implicit expectations explicit, the unmeasured measurable, and the unactionable actionable.

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →