Capacity Planning: Scaling Infrastructure Before You Need To
Predict and provision infrastructure capacity before demand outpaces supply. Covers load modeling, bottleneck identification, scaling strategies, cost-capacity tradeoffs, and the planning process that prevents both outages and over-provisioning.
Capacity planning is the discipline of having enough infrastructure to handle your traffic — tomorrow, next month, and during the annual spike — without paying for resources you do not need today. It sits at the intersection of engineering and finance: too little capacity causes outages, too much wastes money.
Most teams do capacity planning reactively: they add more servers after an outage. This guide covers how to plan proactively so you never have the “we ran out of capacity” conversation with your CEO.
The Capacity Planning Process
1. Understand current usage
"We serve 10,000 requests/second with 40% CPU utilization
on 20 servers. Database handles 5,000 queries/second."
2. Model growth
"Traffic grows 15% month-over-month. Black Friday is 3× normal.
Marketing campaign in October expected to add 25%."
3. Identify bottlenecks
"At 15,000 req/s, the database connection pool saturates.
At 20,000 req/s, we run out of API server CPU."
4. Plan capacity additions
"Need 10 more API servers by October.
Need database upgrade (or read replicas) by November."
5. Budget and approve
"Additional infrastructure costs $X/month.
Prevents an outage that costs $Y/hour."
6. Execute and verify
"Deploy additional capacity. Load test to verify."
Resource Utilization Tracking
| Resource | Measure | Healthy Range | Danger Zone |
|---|---|---|---|
| CPU | Average and p99 utilization | 40-60% average | > 80% sustained |
| Memory | Used / Available | 50-70% | > 85% |
| Disk I/O | IOPS and throughput | < 60% of provisioned | > 80% |
| Network | Bandwidth utilization | < 50% of link capacity | > 70% |
| Database connections | Active / Max | < 60% of pool | > 80% of pool |
| Queue depth | Messages waiting | < 100 messages | Growing consistently |
Utilization over time — identify the trend:
100% ────────────────────────── Capacity limit
90% ─────────────────────╱──── Danger zone
80% ──────────────────╱──────
70% ───────────────╱─────────
60% ────────────╱──────────── ← Current utilization trend
50% ─────────╱───────────────
40% ──────╱──────────────────
30% ───╱─────────────────────
0% ╱────────────────────────
Jan Feb Mar Apr May Jun Jul Aug
At 15% monthly growth:
Jan: 40% → Apr: 65% → Jul: 100% (outage)
Action required by May (60%) to have capacity
available by June (70%).
Lead time for new infrastructure: 2-4 weeks
Therefore: start procurement in April.
Load Modeling
Calculating Capacity Requirements
Current state:
10,000 requests/second
20 servers
40% CPU utilization per server
→ Each server handles 500 req/s at 40% CPU
→ Each server max capacity ≈ 1,250 req/s (at 100% CPU)
→ Comfortable capacity (at 70% CPU): 875 req/s per server
Target state (Black Friday, 3× traffic):
30,000 requests/second
At 875 req/s per server (70% target utilization):
→ Need 30,000 / 875 = 35 servers
→ Currently have 20
→ Need 15 additional servers
Plus buffer (20% for unexpected spikes):
→ 35 × 1.2 = 42 servers total
→ Need 22 additional servers
The Bottleneck Hierarchy
Typical bottleneck order (what breaks first):
1. Database connections (pool exhaustion)
2. Database query latency (CPU or I/O bound)
3. Application server memory (GC pressure, memory leaks)
4. Application server CPU (compute-bound workloads)
5. Network bandwidth (large payloads, media serving)
6. External API rate limits (third-party dependencies)
7. DNS resolution (often overlooked, cache TTL issues)
Start investigation at #1 and work down.
Each bottleneck has a different scaling strategy.
Scaling Strategies
| Strategy | Best For | Limitation |
|---|---|---|
| Vertical scaling (bigger instance) | Database, single-threaded workloads | Physical limits, expensive at top tier |
| Horizontal scaling (more instances) | Stateless services, web servers | Requires stateless design |
| Caching | Read-heavy workloads | Cache invalidation complexity |
| Read replicas | Read-heavy database workloads | Replication lag |
| CDN | Static assets, media | Dynamic content not cacheable |
| Async processing | Background jobs, batch operations | Increased system complexity |
| Sharding | Very large datasets | Application complexity, cross-shard queries |
Auto-Scaling Configuration
# Kubernetes HPA (Horizontal Pod Autoscaler)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: checkout-api
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: checkout-api
minReplicas: 5 # Never go below 5 (night time traffic)
maxReplicas: 50 # Never go above 50 (budget control)
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60 # Scale up when CPU > 60%
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "500" # Scale when > 500 req/s per pod
behavior:
scaleUp:
stabilizationWindowSeconds: 60 # Wait 1 min before scaling up
policies:
- type: Percent
value: 50 # Add up to 50% more pods
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scaling down
policies:
- type: Percent
value: 10 # Remove max 10% at a time
periodSeconds: 120
Implementation Checklist
- Track utilization for all critical resources: CPU, memory, disk, network, connections
- Model traffic growth: monthly growth rate + known events (campaigns, launches)
- Identify your bottleneck hierarchy: what breaks first as traffic increases?
- Set utilization thresholds: alert at 70%, plan at 60% sustained
- Calculate capacity runway: “At current growth, we hit limits in N weeks”
- Configure auto-scaling for stateless services with sensible min/max limits
- Slow down scale-down (5 min stabilization) to prevent oscillation
- Load test quarterly at 2× current peak traffic
- Budget capacity additions quarterly with 20% buffer
- Document capacity assumptions and review them when architecture changes