Capacity Planning
Forecast infrastructure capacity needs to prevent outages from resource exhaustion and avoid waste from over-provisioning. Covers demand modeling, load testing for capacity, resource saturation signals, and building capacity planning into your engineering process.
Capacity planning answers a question Engineering and Finance both care about: “How much infrastructure do we need, and when do we need more?” Under-provision and you get outages during traffic peaks. Over-provision and you waste money on idle resources. The goal is accurate forecasting that keeps you in the sweet spot.
Capacity Planning Process
1. Measure current usage
- CPU, memory, disk, network per service
- Request rate, error rate, latency
- Database connections, query throughput
2. Model demand growth
- Historical growth rates
- Business forecasts (marketing campaigns, launches)
- Seasonal patterns (holidays, end-of-quarter)
3. Determine capacity limits
- Load test to find breaking points
- Identify bottlenecks at each load level
- Define saturation thresholds (e.g., 70% CPU)
4. Forecast when limits will be reached
- Extrapolate growth against limits
- Account for peak-to-average ratio
5. Plan provisioning
- Lead time for procurement/scaling
- Budget approval timeline
- Implementation and testing window
Demand Modeling
Linear Growth
import numpy as np
from datetime import datetime, timedelta
def forecast_linear(history, days_ahead=90):
"""Simple linear regression forecast"""
x = np.arange(len(history))
y = np.array(history)
# Linear fit
slope, intercept = np.polyfit(x, y, 1)
# Forecast
future_x = np.arange(len(history), len(history) + days_ahead)
forecast = slope * future_x + intercept
return {
'daily_growth': slope,
'days_to_threshold': (threshold - history[-1]) / slope if slope > 0 else float('inf'),
'forecast_90d': forecast[-1],
}
Seasonal Adjustment
Month Traffic Multiplier
January 0.85 (post-holiday dip)
February 0.90
March 1.00 (baseline)
April 1.05
May 1.10
June 1.00
July 0.95
August 0.95
September 1.10
October 1.15
November 1.40 (pre-holiday)
December 1.60 (holiday peak)
Resource Saturation Signals
| Resource | Warning Threshold | Critical Threshold | Action |
|---|---|---|---|
| CPU | > 70% sustained | > 85% sustained | Scale horizontally |
| Memory | > 75% | > 90% | Investigate leaks, scale up |
| Disk | > 70% | > 85% | Expand storage, archive old data |
| Network | > 60% of bandwidth | > 80% of bandwidth | Upgrade network, add CDN |
| DB connections | > 70% of pool | > 85% of pool | Increase pool, optimize queries |
| DB IOPS | > 70% of provisioned | > 85% of provisioned | Upgrade tier, optimize queries |
Load Testing for Capacity
# k6 capacity test: Gradually increase load to find breaking point
stages:
- duration: '5m'
target: 100 # Warm up
- duration: '10m'
target: 500 # Normal load
- duration: '10m'
target: 1000 # 2x normal
- duration: '10m'
target: 2000 # 4x normal (expected peak)
- duration: '10m'
target: 5000 # 10x normal (find the ceiling)
- duration: '5m'
target: 0 # Cool down
thresholds:
http_req_duration: ['p99<500']
http_req_failed: ['rate<0.01']
What to Measure During Load Tests
At each load level, record:
- Response time (p50, p95, p99)
- Error rate
- Throughput (requests/second)
- CPU utilization per service
- Memory utilization per service
- Database query latency
- Queue depth
- Connection pool utilization
The level where any metric breaches its threshold is your capacity limit.
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| No capacity monitoring | Surprised by resource exhaustion | Monitor saturation signals |
| Planning from averages only | Peaks cause outages | Plan for peak (2-3x average) |
| Annual capacity review | 11 months of drift | Monthly review, quarterly deep-dive |
| No load testing | Unknown breaking points | Quarterly load tests |
| Over-provisioning 3x everywhere | Wasted spend | Right-size based on actual peak + buffer |
Capacity planning is not about predicting the future perfectly — it is about reducing surprise. Accurate demand modeling, regular load testing, and saturation monitoring give you the lead time to provision before you run out.