Cloud Cost Anomaly Detection
Detect and respond to unexpected cloud spending spikes automatically. Covers anomaly detection algorithms, budget alerts, spend forecasting, root cause analysis, and the patterns that catch cloud cost explosions before they drain the budget.
A single misconfigured auto-scaling policy, a forgotten GPU instance, or a runaway Lambda function can generate thousands of dollars in unexpected charges overnight. Cost anomaly detection catches these spikes early — before they become line items on a painful invoice.
Anomaly Detection Approaches
Threshold-Based (Simple):
Rule: Alert if daily spend exceeds $500
Problem: $500 is normal on Tuesday (big batch job) but abnormal on Sunday
Rule: Alert if spend exceeds 120% of previous day
Problem: Ignores seasonal patterns (month-end processing)
Statistical (Better):
Method: Moving average + standard deviation
Alert: Spend deviates more than 2σ from 30-day moving average
Handles: Daily fluctuation, but not seasonal patterns
ML-Based (Best):
Method: Time-series forecasting (Prophet, ARIMA)
Alert: Actual spend outside 95% confidence interval of forecast
Handles: Daily patterns, weekly patterns, seasonal patterns
Example: AWS Cost Anomaly Detection uses this approach
Implementation
import numpy as np
from datetime import datetime, timedelta
class CostAnomalyDetector:
def __init__(self, lookback_days=30, sensitivity=2.0):
self.lookback_days = lookback_days
self.sensitivity = sensitivity # Number of standard deviations
def detect_anomalies(self, daily_costs: list[dict]) -> list[dict]:
"""Detect cost anomalies using statistical methods."""
anomalies = []
for i in range(self.lookback_days, len(daily_costs)):
# Window of recent history
window = daily_costs[i - self.lookback_days:i]
current = daily_costs[i]
# Calculate statistics by day-of-week for seasonality
dow = current["date"].weekday()
same_dow_costs = [d["cost"] for d in window if d["date"].weekday() == dow]
if len(same_dow_costs) >= 3:
mean = np.mean(same_dow_costs)
std = np.std(same_dow_costs)
if std > 0:
z_score = (current["cost"] - mean) / std
if abs(z_score) > self.sensitivity:
anomalies.append({
"date": current["date"],
"actual_cost": current["cost"],
"expected_cost": mean,
"deviation_pct": ((current["cost"] - mean) / mean) * 100,
"z_score": z_score,
"severity": self.classify_severity(z_score),
})
return anomalies
def classify_severity(self, z_score):
abs_z = abs(z_score)
if abs_z > 4: return "CRITICAL"
if abs_z > 3: return "HIGH"
if abs_z > 2: return "MEDIUM"
return "LOW"
def root_cause_analysis(self, anomaly_date, cost_data):
"""Identify which service/resource caused the anomaly."""
today = cost_data.get_by_service(anomaly_date)
yesterday = cost_data.get_by_service(anomaly_date - timedelta(days=1))
deltas = []
for service in today:
prev_cost = yesterday.get(service, 0)
curr_cost = today[service]
delta = curr_cost - prev_cost
if delta > 0:
deltas.append({
"service": service,
"previous_cost": prev_cost,
"current_cost": curr_cost,
"increase": delta,
"increase_pct": (delta / prev_cost * 100) if prev_cost > 0 else float('inf'),
})
return sorted(deltas, key=lambda x: x["increase"], reverse=True)[:5]
Alert Configuration
# Multi-tier cost alerting
alerts:
budget_threshold:
- name: "Monthly budget 50% warning"
threshold: 50%
channel: slack
message: "📊 Monthly spend has reached 50% of budget"
- name: "Monthly budget 80% warning"
threshold: 80%
channel: [slack, email]
message: "⚠️ Monthly spend at 80% of budget"
- name: "Monthly budget exceeded"
threshold: 100%
channel: [slack, email, pagerduty]
message: "🚨 Monthly budget EXCEEDED"
anomaly:
- name: "Daily cost spike"
condition: "z_score > 2.5"
channel: [slack, email]
include: "root_cause_analysis"
- name: "Critical cost explosion"
condition: "z_score > 4.0"
channel: [slack, pagerduty]
action: "auto_investigate"
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| Monthly bill review only | 30 days of undetected anomalies | Daily anomaly detection |
| Static thresholds only | Miss context-dependent anomalies | Statistical or ML-based detection |
| Alert without root cause | ”Costs are high” — but why? | Automated root cause analysis |
| No per-service breakdown | Cannot identify the offending resource | Cost breakdown by service, account, tag |
| Alert fatigue | Critical alerts ignored | Severity tiers, actionable alerts only |
Cost anomaly detection is insurance against surprise bills. The $50/month you spend detecting anomalies can save $50,000 from a misconfigured resource running unnoticed for weeks.