Cloud Cost Optimization Playbook
Systematic cloud cost optimization for AWS, Azure, and GCP. Covers rightsizing, reserved capacity, spot instances, storage tiering, and building a FinOps practice.
Cloud costs are the new infrastructure debt. The average enterprise wastes 30-35% of cloud spending on over-provisioned resources, forgotten instances, and misconfigured storage tiers. Unlike on-premise hardware, cloud costs compound silently — a misconfigured auto-scaling policy can burn thousands before anyone notices.
FinOps isn’t about cutting costs. It’s about maximizing business value per cloud dollar. Sometimes that means spending more on the right resources and less on the wrong ones. The goal is informed spending, not austerity.
The FinOps Maturity Model
| Level | Capabilities | Typical Waste |
|---|---|---|
| Crawl | Basic cost visibility, no optimization | 40-60% |
| Walk | Rightsizing, reserved instances, basic tagging | 20-35% |
| Run | Automated optimization, showback/chargeback, forecasting | 10-15% |
Most teams are at “Crawl” and think they’re at “Walk” because they have a cost dashboard.
The Big Three Cost Levers
1. Rightsizing (Save 20-40%)
The single highest-impact optimization. Most instances run at 10-30% CPU average utilization.
Protocol:
- Pull 14 days of CPU/memory metrics for all compute instances
- Flag instances below 40% average utilization
- Recommend one size down (or spot/serverless for < 20%)
- Exclude: burst workloads, batch processing, stateful databases
Watch out: Don’t rightsize based on average alone. Check P95 utilization — an instance at 15% average but 95% peak during business hours is correctly sized. Use percentile-based analysis, not averages.
2. Reserved Capacity (Save 30-60%)
For steady-state workloads, reserved instances or savings plans reduce costs 30-60% compared to on-demand pricing.
Decision Framework:
| Workload Pattern | Recommendation | Savings |
|---|---|---|
| 24/7 steady-state | 1-year reserved (all upfront) | 40-50% |
| Business hours only | Scheduled reserved + spot | 30-40% |
| Seasonal peaks | On-demand + spot for burst | 20-30% |
| Development/Test | Spot instances | 60-70% |
Critical rule: Never reserve more than 70% of your baseline usage. The remaining 30% should be on-demand to absorb growth and changes. Over-committed reservations are worse than no reservations.
3. Storage Tiering (Save 40-70%)
Storage costs grow monotonically — data accumulates but rarely gets deleted. Implement lifecycle policies from day one.
Hot (frequent access): S3 Standard / Azure Hot
Warm (monthly access): S3 IA / Azure Cool → 40% savings
Cold (quarterly access): S3 Glacier IR / Azure Cold → 60% savings
Archive (compliance only): S3 Deep Archive / Archive → 90% savings
Automated lifecycle policy: Move objects to cheaper tiers based on last-access date. Most cloud providers support this natively. The rule of thumb: if data hasn’t been accessed in 90 days, it should move to cold storage.
Tagging Strategy
You can’t optimize what you can’t attribute. Tags are the foundation of cloud cost management.
Mandatory Tags (enforce via policy):
cost-center: Who pays for this resourceenvironment: prod / staging / dev / sandboxservice: Which application or service owns thisowner: Team or individual responsible
Enforcement: Block resource creation without mandatory tags. Every major cloud provider supports tag policies. Resources without proper tags are invisible to cost analysis.
Automated Cost Controls
Budget Alerts
Set alerts at 50%, 80%, and 100% of monthly budget per cost center. Alert at 50% to catch early trends, not just overruns.
Anomaly Detection
Cloud providers offer built-in anomaly detection (AWS Cost Anomaly Detection, Azure Cost Alerts). Custom implementation for more control:
class CostAnomalyDetector:
def check(self, service: str, current_daily_cost: float) -> bool:
historical = self.get_rolling_30d(service)
mean = statistics.mean(historical)
std = statistics.stdev(historical)
# Alert if current day exceeds 2 standard deviations
threshold = mean + (2 * std)
return current_daily_cost > threshold
Scheduled Shutdowns
Development and staging environments don’t need to run 24/7. Schedule them to shut down at 7 PM and start at 7 AM. This single policy saves 65% on non-production compute.
Building the FinOps Practice
- Week 1: Enable cost explorer and implement mandatory tagging
- Week 2: Generate first rightsizing report
- Week 4: Purchase first reserved capacity (conservative, 50% of baseline)
- Month 2: Implement storage lifecycle policies and scheduled shutdowns
- Month 3: Build showback reports for each team
- Ongoing: Monthly cost review with engineering leads
The teams that treat cloud cost as an engineering metric — alongside latency and availability — consistently spend 30-50% less than teams that treat it as a finance problem.