Anomaly Detection at Scale | The Garnet Wiki

Anomaly detection answers the most operationally critical question: “Is something abnormal happening?” In time-series data, anomalies indicate system failures, security breaches, or fraud. In user behavior data, anomalies reveal bot activity, account compromise, or emerging trends. The challenge is detecting genuine anomalies while minimizing false positives in high-volume, noisy data streams.

Detection Methods

Statistical Methods (best for univariate time series):

  Z-Score: Flag if value > N standard deviations from mean
    Simple, fast, works for Gaussian distributions
    Fails: Non-Gaussian data, seasonal patterns
    
  IQR (Interquartile Range): Flag if value > Q3 + 1.5×IQR
    Robust to outliers, no distribution assumption
    Fails: Multimodal distributions, trending data
    
  EWMA (Exponentially Weighted Moving Average):
    Adapts to recent trends, detects shifts
    Good for: Monitoring metrics with natural drift
    
  Seasonal Decomposition:
    Decompose into trend + seasonal + residual
    Flag anomalies in the residual component
    Good for: Metrics with daily/weekly patterns

Machine Learning Methods (multivariate, complex patterns):

  Isolation Forest:
    Concept: Anomalies are easier to isolate (fewer splits)
    Pros: No distribution assumption, works on high-dimensional data
    Cons: Needs tuning of contamination parameter
    
  Local Outlier Factor (LOF):
    Concept: Compare local density of a point to its neighbors
    Pros: Detects local anomalies (clusters of different densities)
    Cons: Slow on large datasets
    
  Autoencoders:
    Concept: Train to reconstruct normal data; high reconstruction
    error = anomaly
    Pros: Captures complex patterns, works on any data type
    Cons: Requires training data, slow inference

Implementation

import numpy as np
from sklearn.ensemble import IsolationForest

class AnomalyDetector:
    """Production anomaly detection for time-series metrics."""
    
    def __init__(self, window_size: int = 168):
        """Initialize with rolling window (168 hours = 1 week)."""
        self.window_size = window_size
    
    def detect_zscore(self, values: list, threshold: float = 3.0):
        """Simple Z-score anomaly detection."""
        mean = np.mean(values[-self.window_size:])
        std = np.std(values[-self.window_size:])
        
        if std == 0:
            return {"is_anomaly": False, "reason": "zero_variance"}
        
        current = values[-1]
        z_score = abs(current - mean) / std
        
        return {
            "is_anomaly": z_score > threshold,
            "z_score": round(z_score, 2),
            "current_value": current,
            "expected_range": [
                round(mean - threshold * std, 2),
                round(mean + threshold * std, 2),
            ],
            "severity": (
                "critical" if z_score > 5 else
                "warning" if z_score > threshold else
                "normal"
            ),
        }
    
    def detect_seasonal(self, values: list, period: int = 24):
        """Detect anomalies accounting for seasonal patterns."""
        # Get same-hour historical values
        hour_values = values[-period * 7::period]  # Same hour, last 7 days
        
        if len(hour_values) < 3:
            return self.detect_zscore(values)
        
        seasonal_mean = np.mean(hour_values)
        seasonal_std = np.std(hour_values)
        
        current = values[-1]
        deviation = abs(current - seasonal_mean)
        
        return {
            "is_anomaly": deviation > 3 * seasonal_std,
            "current_value": current,
            "seasonal_expected": round(seasonal_mean, 2),
            "deviation_ratio": round(deviation / max(seasonal_std, 0.001), 2),
        }

Anti-Patterns

Anti-Pattern	Consequence	Fix
Static thresholds	Cannot adapt to trend changes	Dynamic thresholds based on rolling statistics
Ignore seasonal patterns	Normal weekend dips flagged as anomalies	Seasonal decomposition before anomaly detection
No false positive tracking	Alert fatigue, real anomalies ignored	Track and optimize false positive rate
Single detection method	Misses some anomaly types	Ensemble: combine multiple detection methods
No root cause context	Anomaly detected but no debugging information	Include related metrics and recent changes in alerts

Anomaly detection is the first line of defense in operational intelligence. The goal is not zero false positives — it is tuning the system so that when an alert fires, engineers trust it and act on it. Trust is built through precision and destroyed by noise.

Detection Methods

Implementation

Anti-Patterns

More in Data Science

A/B Testing at Scale

A/B Testing Statistical Framework

A/B Testing Infrastructure: Making Data-Driven Decisions Without Breaking Production