Kubernetes Resource Management: Stop Wasting Money on Over-Provisioned Pods
Master Kubernetes resource requests, limits, QoS classes, and horizontal/vertical autoscaling. Covers the real cost of over-provisioning, how to rightsize workloads, and the resource management patterns that save money without risking stability.
The average Kubernetes cluster is 60-70% over-provisioned. Engineers set CPU requests to 500m and memory to 512Mi because that is what the template said — regardless of whether the pod actually uses 50m of CPU and 128Mi of memory. Multiply this by 200 pods and you are paying for 10 nodes when 4 would suffice.
The reason this happens is fear. Nobody wants to be the person who under-provisioned a pod and caused an OOM kill in production. So they over-provision, the cluster is stable, and the cloud bill silently grows. This guide covers how to rightsize workloads based on data, not fear.
Requests vs Limits: What They Actually Do
| Setting | What It Does | What Happens If Wrong |
|---|---|---|
| CPU request | Guaranteed CPU. Scheduler uses this to place pods. | Too high → wasted capacity, pods cannot be scheduled. Too low → CPU throttling under load. |
| CPU limit | Maximum CPU. Pod is throttled (not killed) above this. | Too high → noisy neighbor risk. Too low → artificial performance ceiling. |
| Memory request | Guaranteed memory. Scheduler uses this to place pods. | Too high → wasted capacity. Too low → OOM killed under load. |
| Memory limit | Maximum memory. Pod is OOM killed above this. | Too high → risk of node memory pressure. Too low → OOM kills. |
apiVersion: v1
kind: Pod
spec:
containers:
- name: api
resources:
requests:
cpu: "100m" # 0.1 CPU cores guaranteed
memory: "256Mi" # 256 MB guaranteed
limits:
cpu: "500m" # Throttled above 0.5 cores
memory: "512Mi" # OOM killed above 512 MB
QoS Classes
Kubernetes assigns Quality of Service classes based on how you set requests and limits:
| QoS Class | Condition | Eviction Priority |
|---|---|---|
| Guaranteed | requests == limits for all containers | Last to be evicted (safest) |
| Burstable | requests < limits | Evicted after BestEffort |
| BestEffort | No requests or limits set | First to be evicted (most risky) |
Production recommendation: Set requests for ALL pods (never BestEffort in production). Set memory limits always. CPU limits are debatable — some teams remove them to avoid throttling.
The CPU Limit Debate
| Approach | Argument |
|---|---|
| Set CPU limits | Prevents noisy neighbors. Predictable performance. |
| Remove CPU limits | Avoids artificial throttling. Pods use idle CPU. Better utilization. |
| Compromise | Set generous CPU limits (3x request). Throttling only under extreme load. |
Google’s internal research suggests removing CPU limits and relying on requests for scheduling gives better overall cluster utilization. However, this only works if your cluster has headroom and your monitoring catches CPU contention.
Rightsizing Workloads
Step 1: Measure Actual Usage
# CPU: P95 usage over 14 days (per pod)
quantile_over_time(0.95,
rate(container_cpu_usage_seconds_total{namespace="production"}[5m])[14d:5m]
)
# Memory: P95 usage over 14 days (per pod)
quantile_over_time(0.95,
container_memory_working_set_bytes{namespace="production"}[14d]
)
Step 2: Compare to Requests
Pod: checkout-api
CPU request: 500m | CPU P95 usage: 80m | Over-provisioned: 84%
Memory request: 512Mi | Memory P95 usage: 180Mi | Over-provisioned: 65%
Recommendation:
CPU request: 150m (P95 × 1.5 safety margin)
Memory request: 270Mi (P95 × 1.5 safety margin)
Memory limit: 400Mi (P99 × 1.5 safety margin)
Monthly savings per replica: ~$12
× 5 replicas = $60/month
× 40 similar pods = $2,400/month
Step 3: Apply with VPA
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: checkout-api-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: checkout-api
updatePolicy:
updateMode: "Off" # Start with "Off" — recommendations only
# Switch to "Auto" after validating recommendations
resourcePolicy:
containerPolicies:
- containerName: api
minAllowed:
cpu: "50m"
memory: "128Mi"
maxAllowed:
cpu: "2"
memory: "2Gi"
Horizontal Pod Autoscaling (HPA)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: checkout-api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: checkout-api
minReplicas: 3 # Never go below 3 for HA
maxReplicas: 20 # Cost ceiling
behavior:
scaleUp:
stabilizationWindowSeconds: 60 # Wait 1 min before scaling up
policies:
- type: Pods
value: 4 # Add max 4 pods at a time
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scaling down
policies:
- type: Percent
value: 25 # Remove max 25% of pods at a time
periodSeconds: 120
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 # Scale when CPU > 70% of request
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80 # Scale when memory > 80% of request
HPA Common Mistakes
| Mistake | What Happens | Fix |
|---|---|---|
| minReplicas = 1 | Single point of failure. No HA. | Set minReplicas ≥ 2 (3 for critical services) |
| maxReplicas too high | Auto-scaler provisions 100 pods, $10K bill | Set realistic max based on actual load patterns |
| Scaling on CPU only | Memory-bound apps never scale | Add memory metric alongside CPU |
| No stabilization window | Rapid scale up/down (flapping) | Set scaleDown stabilization ≥ 300s |
| HPA + VPA on same metric | They fight each other | Use VPA for vertical, HPA for horizontal on different metrics |
Node Sizing Strategy
| Approach | Pros | Cons |
|---|---|---|
| Few large nodes | Better bin-packing, less overhead | Bigger blast radius per node failure |
| Many small nodes | Smaller blast radius, faster replacement | More scheduling overhead, worse utilization |
| Mixed sizes | Flexible, optimize for different workloads | More complex management |
General guidance:
- Production: 4-16 vCPU, 16-64GB nodes (balance of flexibility and utilization)
- Batch/ML: Large nodes (32+ vCPU) for GPU or memory-intensive workloads
- System pods: Small dedicated nodes (2 vCPU) for kube-system, monitoring
Never run production on fewer than 3 nodes (HA requirement).
Target 60-75% node utilization (leave headroom for bursts and pod scheduling).
Implementation Checklist
- Audit all pods: compare resource requests to actual P95 usage over 14 days
- Rightsize top 20 most over-provisioned pods (biggest savings, lowest risk)
- Set memory limits on all production pods (prevent unbounded memory growth)
- Deploy VPA in recommendation mode to get ongoing rightsizing suggestions
- Configure HPA for stateless services with scaleDown stabilization ≥ 300 seconds
- Set minReplicas ≥ 3 for all critical services
- Never run BestEffort pods in production (always set resource requests)
- Monitor cluster utilization: target 60-75% node CPU/memory usage
- Review resource usage monthly and apply rightsizing recommendations
- Calculate monthly savings from rightsizing and report to leadership