SRE Capacity Forecasting | The Garnet Wiki

Capacity planning fails when it is reactive. By the time a database runs out of disk space or an application server hits 100% CPU, you have an outage. Capacity forecasting is the discipline of predicting future resource needs based on historical growth, seasonal patterns, and planned business events — so infrastructure scales proactively, not in response to pages.

Forecasting Framework

Inputs for Capacity Forecasting:

Historical Data:
  ├── CPU utilization (last 6-12 months)
  ├── Memory usage trends
  ├── Storage growth rate (GB/month)
  ├── Network throughput trends
  ├── Request rate (RPS) growth
  └── Database connection counts

Business Context:
  ├── Expected user growth (marketing campaigns)
  ├── Feature launches (new features = new load patterns)
  ├── Seasonal patterns (Black Friday, end of quarter)
  ├── Acquisition/merger (sudden traffic increase)
  └── Geographic expansion (new regions)

Output → Capacity Runway:
  "At current growth rate:
   CPU headroom: 4.2 months before needing to scale
   Storage: 2.1 months before 80% utilization
   Database connections: 6.8 months before pool exhaustion
   
   With Black Friday event:
   CPU headroom: 1.3 months — ACTION REQUIRED"

Forecasting Implementation

import numpy as np
from datetime import datetime, timedelta

class CapacityForecaster:
    """Predict when resources will hit capacity thresholds."""
    
    def forecast_resource(self, metric_name: str, 
                          history: list, threshold: float = 0.8):
        """Project when a resource will hit threshold."""
        
        # Extract daily averages
        days = np.array([h["day_number"] for h in history])
        values = np.array([h["utilization"] for h in history])
        
        # Fit linear regression (simple but effective for most resources)
        slope, intercept = np.polyfit(days, values, 1)
        
        # Calculate days until threshold
        current_value = values[-1]
        if slope <= 0:
            return {
                "metric": metric_name,
                "current": current_value,
                "trend": "flat_or_declining",
                "runway_days": float('inf'),
                "action": "No scaling needed",
            }
        
        days_to_threshold = (threshold - current_value) / slope
        threshold_date = datetime.now() + timedelta(days=days_to_threshold)
        
        # Classify urgency
        if days_to_threshold < 30:
            urgency = "critical"
            action = f"Scale {metric_name} immediately"
        elif days_to_threshold < 90:
            urgency = "warning"
            action = f"Plan {metric_name} scaling for next quarter"
        else:
            urgency = "healthy"
            action = f"{metric_name} has sufficient runway"
        
        return {
            "metric": metric_name,
            "current": round(current_value, 3),
            "growth_rate_per_day": round(slope, 5),
            "runway_days": round(days_to_threshold, 1),
            "threshold_date": threshold_date.isoformat(),
            "urgency": urgency,
            "action": action,
        }

Anti-Patterns

Anti-Pattern	Consequence	Fix
No forecasting, only monitoring	Outages from resource exhaustion	Project utilization trends weekly
Linear forecasting only	Miss seasonal spikes	Seasonal decomposition, factor in business events
Forecast compute ignore storage	Storage fills silently	Forecast ALL resource dimensions
No buffer in forecasts	Scale exactly at limit, no headroom	Target 80% threshold, not 100%
Annual capacity review only	12 months between assessments	Monthly automated forecasting

Capacity forecasting is the difference between scaling proactively (“we need to add capacity next month”) and scaling reactively (“production is down because the disk is full”). The former is engineering; the latter is firefighting.

Forecasting Framework

Forecasting Implementation

Anti-Patterns

More in Site Reliability Engineering

Capacity Planning: Scaling Infrastructure Before You Need To

Capacity Planning

Chaos Engineering: Breaking Things on Purpose to Build Confidence