Cloud Cost Anomaly Detection
Detect and respond to unexpected cloud cost spikes before they become billing surprises. Covers anomaly detection algorithms, alerting thresholds, root cause analysis workflows, and building automated cost guardrails.
A misconfigured auto-scaler. A developer spinning up GPU instances for testing and forgetting to tear them down. A log pipeline that starts ingesting 10x normal volume. These are not hypothetical scenarios — they are the cost anomalies that hit every cloud-native organization. The question is whether you detect them in minutes or discover them on next month’s invoice.
Detection Approaches
Threshold-Based
alerts:
- name: daily_spend_spike
condition: daily_spend > (30_day_average * 1.5)
severity: warning
- name: service_cost_spike
condition: service_daily_spend > (service_7_day_average * 2.0)
severity: critical
- name: new_expensive_resource
condition: new_resource_hourly_cost > 10
severity: warning
Statistical (Z-Score)
import numpy as np
def detect_anomaly(daily_costs, threshold=2.5):
mean = np.mean(daily_costs[-30:]) # 30-day baseline
std = np.std(daily_costs[-30:])
today = daily_costs[-1]
z_score = (today - mean) / std if std > 0 else 0
return {
'is_anomaly': abs(z_score) > threshold,
'z_score': z_score,
'expected_range': (mean - threshold * std, mean + threshold * std),
'actual': today,
'deviation_pct': ((today - mean) / mean) * 100
}
Machine Learning
Cloud providers offer built-in anomaly detection:
AWS Cost Anomaly Detection:
- Monitors AWS services, accounts, tags, cost categories
- ML-based, no configuration needed
- Alerts via SNS, email, or Slack
GCP Budgets with Anomaly Detection:
- Forecasting-based alerts
- Custom threshold percentages
Azure Cost Management Anomaly Alerts:
- ML-based anomaly detection
- Integration with Action Groups
Root Cause Analysis Workflow
When an anomaly is detected:
Step 1: Identify WHAT changed
- Which service/resource increased?
- Which account/team?
- Which region?
Step 2: Identify WHEN it changed
- Exact start time of increase
- Gradual or sudden?
- Correlate with deployments, config changes
Step 3: Identify WHO/WHAT caused it
- CloudTrail/Activity Log: who created the resource?
- Deployment log: was code deployed?
- Auto-scaling events: did scaling trigger?
Step 4: Determine impact
- Total additional cost so far
- Projected cost if unchanged
- Is it expected growth or waste?
Step 5: Remediate
- Terminate unused resources
- Fix configuration
- Adjust auto-scaling limits
- Update guardrails to prevent recurrence
Automated Guardrails
Budget Enforcement
def enforce_budget(team, monthly_budget):
current_spend = get_current_month_spend(team)
days_elapsed = get_days_elapsed_in_month()
days_in_month = get_days_in_month()
burn_rate = current_spend / days_elapsed
projected = burn_rate * days_in_month
if projected > monthly_budget * 1.2:
# Projected 20% over budget
block_new_resource_creation(team)
notify_team_lead(team, projected, monthly_budget)
elif projected > monthly_budget:
# Projected over budget
notify_team(team, projected, monthly_budget)
Resource Limits
{
"policy": "max_instance_cost",
"rules": [
{
"max_hourly_cost_per_instance": 5.00,
"exceptions": ["ml-training"],
"action": "deny_with_approval_link"
},
{
"max_ebs_volume_size_gb": 1000,
"action": "require_justification"
},
{
"max_idle_hours_before_stop": 8,
"applies_to": ["non-production"],
"action": "auto_stop"
}
]
}
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| Monthly bill review only | 30-day delay before catching anomalies | Daily anomaly detection |
| Alert on total spend only | Cannot identify which service spiked | Per-service, per-team monitoring |
| No automated remediation | Waste continues until human intervenes | Auto-stop idle resources |
| Ignoring small anomalies | Small leaks accumulate to large waste | Alert on percentage change, not just absolute |
| No post-incident review | Same anomaly type recurs | Root cause analysis + guardrail for each incident |
Cost anomaly detection is your financial fire alarm. The faster you detect unexpected spend, the less money you burn.