SLO-Based Alerting | The Garnet Wiki

Traditional alerting fires when a metric crosses a threshold: “CPU > 80% → page the on-call.” This produces noise because high CPU might not affect users, and low CPU does not mean everything is fine. SLO-based alerting fires when the error budget is burning too fast — meaning real users are experiencing real problems.

From Threshold to SLO

Threshold Alerting (Old):
  IF cpu_utilization > 80% THEN page
  IF response_time > 500ms THEN page
  IF error_rate > 1% THEN page
  
  Problems:
  - High CPU with happy users = unnecessary page
  - 500ms latency might be fine for batch jobs
  - 1% error rate: is that 10 errors or 10,000?
  
SLO-Based Alerting (New):
  SLO: 99.9% of requests succeed within 300ms
  Error budget: 0.1% = 43.2 minutes/month
  Alert: IF burning error budget 10x faster than sustainable THEN page
  
  Benefits:
  - Only alerts when users are impacted
  - Severity proportional to user impact
  - Error budget provides decision framework

Error Budget Math

# SLO: 99.9% availability per month
slo_target = 0.999

# Error budget per month
total_minutes_per_month = 30 * 24 * 60  # 43,200 minutes
error_budget_minutes = total_minutes_per_month * (1 - slo_target)
# 43,200 * 0.001 = 43.2 minutes/month

# Sustainable burn rate: consume budget evenly over 30 days
sustainable_burn_rate = 1.0  # 1x = budget consumed exactly over the month

# Alert thresholds based on burn rate:
#   14.4x burn rate: budget consumed in ~3 days (fast burn, high urgency)
#    6x burn rate: budget consumed in ~5 days (moderate burn)
#    3x burn rate: budget consumed in ~10 days (slow burn)
#    1x burn rate: budget consumed exactly at month end (normal)

class ErrorBudgetCalculator:
    def __init__(self, slo_target, window_days=30):
        self.slo_target = slo_target
        self.error_budget = (1 - slo_target) * window_days * 24 * 60  # minutes
    
    def current_burn_rate(self, errors_last_hour, total_requests_last_hour):
        """Calculate current burn rate as multiple of sustainable rate."""
        error_rate = errors_last_hour / total_requests_last_hour
        allowed_error_rate = 1 - self.slo_target
        return error_rate / allowed_error_rate
    
    def budget_remaining(self, errors_this_month, total_requests_this_month):
        """Calculate remaining error budget as percentage."""
        error_rate = errors_this_month / total_requests_this_month
        budget_used = error_rate / (1 - self.slo_target)
        return max(0, 1 - budget_used)

Multi-Window Burn Rate Alerts

# Fast burn: Page immediately
- alert: HighBurnRate_Page
  expr: |
    (
      1 - (rate(http_requests_total{code!~"5.."}[1h]) / rate(http_requests_total[1h]))
    ) > (14.4 * (1 - 0.999))
    AND
    (
      1 - (rate(http_requests_total{code!~"5.."}[5m]) / rate(http_requests_total[5m]))
    ) > (14.4 * (1 - 0.999))
  labels:
    severity: page
  annotations:
    summary: "Burning error budget 14x faster than sustainable"
    impact: "SLO will be breached within 3 days at this rate"

# Slow burn: Ticket for review
- alert: SlowBurnRate_Ticket
  expr: |
    (
      1 - (rate(http_requests_total{code!~"5.."}[6h]) / rate(http_requests_total[6h]))
    ) > (3 * (1 - 0.999))
    AND
    (
      1 - (rate(http_requests_total{code!~"5.."}[30m]) / rate(http_requests_total[30m]))
    ) > (3 * (1 - 0.999))
  labels:
    severity: ticket
  annotations:
    summary: "Burning error budget 3x faster than sustainable"
    impact: "SLO at risk if trend continues for 10 days"

Anti-Patterns

Anti-Pattern	Consequence	Fix
Threshold alerts for everything	Alert fatigue, pages ignored	Migrate to SLO-based alerting
No error budget tracking	Cannot make trade-off decisions	Dashboard + monthly reviews
SLO without consequences	SLO is meaningless	Freeze launches when budget exhausted
Single window burn rate	False positives from spikes	Multi-window (1h AND 5m)
Same severity for all alerts	Everything is urgent = nothing is	Tiered: page (fast burn) vs ticket (slow burn)

SLO-based alerting is not about having fewer alerts — it is about having the right alerts. When your pager goes off, it should mean real users are having real problems.

From Threshold to SLO

Error Budget Math

Multi-Window Burn Rate Alerts

Anti-Patterns

More in Site Reliability Engineering

Capacity Planning: Scaling Infrastructure Before You Need To

SRE Capacity Forecasting

Capacity Planning