ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

FinOps: Cloud Cost Engineering

Master cloud financial operations. Covers cost allocation, tagging strategies, reservation management, rightsizing, showback/chargeback, forecasting, and building a FinOps practice.

Cloud spending is the fastest-growing line item in most enterprise IT budgets — and the least understood. FinOps (Cloud Financial Operations) is the discipline of bringing financial accountability to cloud spending. It’s not about cutting costs; it’s about making informed spending decisions that maximize value. The difference between a well-run FinOps practice and an ad-hoc approach is typically 20-35% of total cloud spend.

This guide covers the engineering and organizational practices needed to build a mature FinOps capability: from tagging strategies and cost allocation to reservation optimization, rightsizing, and FinOps tooling.


The FinOps Framework

┌─────────────┐     ┌──────────────┐     ┌──────────────┐
│   INFORM     │────▶│   OPTIMIZE    │────▶│   OPERATE     │
│              │     │              │     │              │
│ • Visibility │     │ • Rightsizing │     │ • Governance  │
│ • Allocation │     │ • Reservations│     │ • Automation  │
│ • Reporting  │     │ • Waste elim. │     │ • Alerts      │
└─────────────┘     └──────────────┘     └──────────────┘
        ↑                                        │
        └────────────────────────────────────────┘
                    Continuous cycle

Cost Allocation & Tagging

Tagging Strategy

Tags are the foundation of cost visibility. Without consistent tagging, you can’t answer “who is spending what, and why?”

# Required tags (enforce via policy)
required_tags:
  - key: "team"
    description: "Owning team"
    values: ["platform", "data", "frontend", "ml", "security"]
    enforcement: "deny_creation_without"
    
  - key: "environment"
    values: ["production", "staging", "development", "sandbox"]
    enforcement: "deny_creation_without"
    
  - key: "service"
    description: "Application or microservice name"
    enforcement: "deny_creation_without"
    
  - key: "cost-center"
    description: "Finance cost center code"
    enforcement: "warn_if_missing"

# Recommended tags (encouraged)
recommended_tags:
  - key: "project"
    description: "Project or initiative name"
  - key: "data-classification"
    values: ["public", "internal", "confidential", "restricted"]
  - key: "managed-by"
    values: ["terraform", "pulumi", "manual", "cdk"]

Tag Compliance Enforcement

# AWS Config rule to enforce tagging
def evaluate_compliance(configuration_item):
    required_tags = ["team", "environment", "service", "cost-center"]
    resource_tags = configuration_item.get("tags", {})
    
    missing = [tag for tag in required_tags if tag not in resource_tags]
    
    if missing:
        return {
            "compliance_type": "NON_COMPLIANT",
            "annotation": f"Missing required tags: {', '.join(missing)}",
        }
    return {"compliance_type": "COMPLIANT"}

Rightsizing

Instance Rightsizing Analysis

def analyze_instance_rightsizing(cloudwatch_metrics, instance_type, days=14):
    """Identify over-provisioned instances."""
    
    cpu_avg = metrics["CPUUtilization"]["average"]
    cpu_max = metrics["CPUUtilization"]["maximum"]
    mem_avg = metrics.get("MemoryUtilization", {}).get("average", None)
    
    recommendation = None
    
    if cpu_max < 40 and (mem_avg is None or mem_avg < 40):
        recommendation = {
            "action": "downsize",
            "reason": f"Max CPU: {cpu_max}%, Avg: {cpu_avg}%",
            "suggested": get_smaller_instance(instance_type),
            "estimated_savings_pct": 40,
        }
    elif cpu_max < 10 and days > 7:
        recommendation = {
            "action": "terminate_or_schedule",
            "reason": f"Consistently idle (max CPU: {cpu_max}%)",
            "estimated_savings_pct": 100,
        }
    elif cpu_avg > 80:
        recommendation = {
            "action": "upsize_or_scale",
            "reason": f"Avg CPU: {cpu_avg}%, risk of performance issues",
            "suggested": get_larger_instance(instance_type),
        }
    
    return recommendation

Rightsizing Priority Matrix

CPU AvgMemory AvgRecommendationSavings Potential
< 10%< 20%Terminate or schedule100%
< 30%< 30%Downsize 2 tiers50-60%
< 50%< 50%Downsize 1 tier25-40%
50-80%50-80%Correct size0%
> 80%> 80%Upsize or auto-scale-20% (but prevents outages)

Reservation & Savings Plans

Commitment Strategy

Commitment TypeDiscountFlexibilityBest For
On-Demand0%FullUnpredictable workloads, testing
Savings Plans (1yr)20-30%High (any instance family)Baseline compute
Savings Plans (3yr)35-50%HighLong-term stable workloads
Reserved Instances (1yr)30-40%Low (specific instance)Databases, known workloads
Reserved Instances (3yr)50-60%LowProduction databases
Spot Instances60-90%None (can be interrupted)Batch processing, CI/CD, ML training

Coverage Analysis

def calculate_commitment_coverage(total_spend, commitments):
    """Analyze how well commitments cover actual spending."""
    
    committed_coverage = sum(c["hourly_commitment"] for c in commitments)
    actual_on_demand = total_spend["on_demand_hourly"]
    
    coverage_pct = min(committed_coverage / actual_on_demand * 100, 100) if actual_on_demand > 0 else 0
    
    waste = max(0, committed_coverage - actual_on_demand)
    waste_pct = waste / committed_coverage * 100 if committed_coverage > 0 else 0
    
    return {
        "coverage_pct": round(coverage_pct, 1),
        "waste_pct": round(waste_pct, 1),
        "target_coverage": 70,  # Industry best practice: 60-80%
        "recommendation": (
            "Increase commitments" if coverage_pct < 60
            else "Good coverage" if coverage_pct < 80
            else "Risk of over-commitment" if waste_pct > 10
            else "Optimal"
        ),
    }

Waste Elimination

Common Waste Categories

Waste TypeDetectionTypical Savings
Unattached EBS volumesNo EC2 attachment$50-500/month per volume
Idle load balancers< 100 requests/day$20-200/month each
Unused elastic IPsNot attached to running instance$4/month each
Over-provisioned RDSCPU < 20% consistently40-60% per instance
Orphaned snapshotsNo associated volume/AMI$10-100/month each
Idle NAT gateways< 1GB data processed/month$32+/month each
Dev/staging running 24/7Running outside business hours65% (run 8hrs vs 24)

Automated Waste Detection

def scan_for_waste(aws_session):
    findings = []
    
    # Unattached EBS volumes
    ec2 = aws_session.client('ec2')
    volumes = ec2.describe_volumes(
        Filters=[{"Name": "status", "Values": ["available"]}]
    )
    for vol in volumes["Volumes"]:
        monthly_cost = vol["Size"] * 0.10  # gp3 pricing
        findings.append({
            "type": "unattached_ebs",
            "resource_id": vol["VolumeId"],
            "monthly_cost": monthly_cost,
            "age_days": (datetime.utcnow() - vol["CreateTime"].replace(tzinfo=None)).days,
            "action": "delete" if monthly_cost > 5 else "review",
        })
    
    # Idle load balancers
    elbv2 = aws_session.client('elbv2')
    albs = elbv2.describe_load_balancers()
    for alb in albs["LoadBalancers"]:
        request_count = get_alb_requests(alb["LoadBalancerArn"], days=7)
        if request_count < 700:  # < 100/day
            findings.append({
                "type": "idle_alb",
                "resource_id": alb["LoadBalancerName"],
                "monthly_cost": 22.50,  # Base ALB cost
                "weekly_requests": request_count,
                "action": "delete_or_consolidate",
            })
    
    total_waste = sum(f["monthly_cost"] for f in findings)
    return {"findings": findings, "total_monthly_waste": total_waste}

Showback & Chargeback

ModelHow It WorksBest For
ShowbackShow teams their costs, no billingBuilding cost awareness
ChargebackCharge teams from their budgetMature orgs with defined budgets
HybridShowback for shared services, chargeback for dedicatedMost enterprises

Monthly Cost Report Template

Team: Platform Engineering
Period: March 2025
Budget: $45,000

┌─────────────────────────────────────────────┐
│ Service         │ Spend    │ % of Total │ Δ  │
├─────────────────┼──────────┼────────────┤────│
│ EC2             │ $18,200  │ 45%        │ +3%│
│ RDS             │ $8,400   │ 21%        │ -1%│
│ S3              │ $3,100   │ 8%         │ +5%│
│ Lambda          │ $2,800   │ 7%         │ +12%│
│ CloudWatch      │ $1,900   │ 5%         │ 0% │
│ Other           │ $5,600   │ 14%        │ -2%│
├─────────────────┼──────────┼────────────┤────│
│ TOTAL           │ $40,000  │ 100%       │ +2%│
│ vs Budget       │ -$5,000  │ 89%        │    │
└─────────────────────────────────────────────┘

Recommendations:
1. Lambda spend up 12% — review new function deployments
2. S3 lifecycle policies could save ~$800/mo
3. 3 idle EC2 instances identified ($450/mo waste)

Anti-Patterns

Anti-PatternProblemFix
No tagging strategyCan’t allocate costs to teamsEnforce tags at resource creation
100% on-demandPaying list price for predictable workloads60-80% commitment coverage for stable baseline
Annual cost reviewProblems discovered 11 months too lateWeekly automated reports, daily anomaly alerts
Centralized cost ownershipPlatform team blamed for all spendingShowback/chargeback makes teams accountable
Over-committing reservationsLocked into resources you no longer needStart with Savings Plans (flexible), only use RIs for databases
Ignoring data transferEgress costs surprise at scaleMonitor data transfer costs, use VPC endpoints

Checklist

  • Tagging strategy defined and enforced via policy
  • Tag compliance > 95% across all resources
  • Cost allocation configured by team, service, and environment
  • Rightsizing analysis automated (weekly scan)
  • Commitment coverage at 60-80% for stable workloads
  • Waste detection running weekly with automated cleanup
  • Showback/chargeback reports distributed monthly
  • Anomaly alerting: daily spend > 120% of trailing average
  • Forecasting: 3-month cost projections updated monthly
  • Dev/staging environments scheduled (off outside business hours)
  • Data transfer costs monitored and optimized
  • FinOps review meeting held monthly with engineering leads

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For FinOps consulting, visit garnetgrid.com. :::

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →