ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

Cloud Cost Optimization Playbook

Systematic cloud cost optimization for AWS, Azure, and GCP. Covers rightsizing, reserved capacity, spot instances, storage tiering, and building a FinOps practice.

Cloud costs are the new infrastructure debt. The average enterprise wastes 30-35% of cloud spending on over-provisioned resources, forgotten instances, and misconfigured storage tiers. Unlike on-premise hardware, cloud costs compound silently — a misconfigured auto-scaling policy can burn thousands before anyone notices.

FinOps isn’t about cutting costs. It’s about maximizing business value per cloud dollar. Sometimes that means spending more on the right resources and less on the wrong ones. The goal is informed spending, not austerity.


The FinOps Maturity Model

LevelCapabilitiesTypical Waste
CrawlBasic cost visibility, no optimization40-60%
WalkRightsizing, reserved instances, basic tagging20-35%
RunAutomated optimization, showback/chargeback, forecasting10-15%

Most teams are at “Crawl” and think they’re at “Walk” because they have a cost dashboard.


The Big Three Cost Levers

1. Rightsizing (Save 20-40%)

The single highest-impact optimization. Most instances run at 10-30% CPU average utilization.

Protocol:

  1. Pull 14 days of CPU/memory metrics for all compute instances
  2. Flag instances below 40% average utilization
  3. Recommend one size down (or spot/serverless for < 20%)
  4. Exclude: burst workloads, batch processing, stateful databases

Watch out: Don’t rightsize based on average alone. Check P95 utilization — an instance at 15% average but 95% peak during business hours is correctly sized. Use percentile-based analysis, not averages.

2. Reserved Capacity (Save 30-60%)

For steady-state workloads, reserved instances or savings plans reduce costs 30-60% compared to on-demand pricing.

Decision Framework:

Workload PatternRecommendationSavings
24/7 steady-state1-year reserved (all upfront)40-50%
Business hours onlyScheduled reserved + spot30-40%
Seasonal peaksOn-demand + spot for burst20-30%
Development/TestSpot instances60-70%

Critical rule: Never reserve more than 70% of your baseline usage. The remaining 30% should be on-demand to absorb growth and changes. Over-committed reservations are worse than no reservations.

3. Storage Tiering (Save 40-70%)

Storage costs grow monotonically — data accumulates but rarely gets deleted. Implement lifecycle policies from day one.

Hot (frequent access):     S3 Standard / Azure Hot
Warm (monthly access):     S3 IA / Azure Cool          → 40% savings
Cold (quarterly access):   S3 Glacier IR / Azure Cold   → 60% savings
Archive (compliance only): S3 Deep Archive / Archive    → 90% savings

Automated lifecycle policy: Move objects to cheaper tiers based on last-access date. Most cloud providers support this natively. The rule of thumb: if data hasn’t been accessed in 90 days, it should move to cold storage.


Tagging Strategy

You can’t optimize what you can’t attribute. Tags are the foundation of cloud cost management.

Mandatory Tags (enforce via policy):

  • cost-center: Who pays for this resource
  • environment: prod / staging / dev / sandbox
  • service: Which application or service owns this
  • owner: Team or individual responsible

Enforcement: Block resource creation without mandatory tags. Every major cloud provider supports tag policies. Resources without proper tags are invisible to cost analysis.


Automated Cost Controls

Budget Alerts

Set alerts at 50%, 80%, and 100% of monthly budget per cost center. Alert at 50% to catch early trends, not just overruns.

Anomaly Detection

Cloud providers offer built-in anomaly detection (AWS Cost Anomaly Detection, Azure Cost Alerts). Custom implementation for more control:

class CostAnomalyDetector:
    def check(self, service: str, current_daily_cost: float) -> bool:
        historical = self.get_rolling_30d(service)
        mean = statistics.mean(historical)
        std = statistics.stdev(historical)
        
        # Alert if current day exceeds 2 standard deviations
        threshold = mean + (2 * std)
        return current_daily_cost > threshold

Scheduled Shutdowns

Development and staging environments don’t need to run 24/7. Schedule them to shut down at 7 PM and start at 7 AM. This single policy saves 65% on non-production compute.


Building the FinOps Practice

  1. Week 1: Enable cost explorer and implement mandatory tagging
  2. Week 2: Generate first rightsizing report
  3. Week 4: Purchase first reserved capacity (conservative, 50% of baseline)
  4. Month 2: Implement storage lifecycle policies and scheduled shutdowns
  5. Month 3: Build showback reports for each team
  6. Ongoing: Monthly cost review with engineering leads

The teams that treat cloud cost as an engineering metric — alongside latency and availability — consistently spend 30-50% less than teams that treat it as a finance problem.

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →