AI Model Monitoring and Drift Detection

An ML model that is 95% accurate at deployment can degrade to 70% within months — and nobody notices because there are no alerts. Model monitoring fills the gap between deployment and retraining by continuously measuring whether the model’s real-world performance matches its training performance. When it drifts, you know immediately.

Types of Drift

Data Drift (feature distribution changes):
  Training data: User age distribution centered at 25-35
  Production data: New market segment, ages 45-65
  Result: Model has never seen this demographic, accuracy drops
  
  Detection: Compare feature distributions over time
  Metric: Population Stability Index (PSI), KL divergence

Concept Drift (relationship between features and target changes):
  Training: High price → low conversion
  Production (after inflation): High price → same conversion
  Result: Model's learned relationship is wrong
  
  Detection: Monitor prediction accuracy against ground truth
  Metric: Accuracy, F1, AUC over time windows

Prediction Drift (model output distribution changes):
  Training: 10% of predictions are "high risk"
  Production: 30% of predictions are "high risk"
  Result: Something changed — either data or real-world conditions
  
  Detection: Monitor prediction distribution
  Metric: Output histogram comparison, KS test

Performance Drift (speed or resource degradation):
  Deployment: 50ms inference latency
  After 6 months: 200ms inference latency
  Result: Model or feature computation has become inefficient
  
  Detection: Monitor latency, throughput, memory
  Metric: P50/P99 latency, throughput

Monitoring Implementation

class ModelMonitor:
    """Production ML model monitoring."""
    
    def check_data_drift(self, reference_data, production_data):
        """Detect feature distribution shifts."""
        drift_results = {}
        
        for feature in reference_data.columns:
            ref_dist = reference_data[feature]
            prod_dist = production_data[feature]
            
            # Population Stability Index
            psi = self.calculate_psi(ref_dist, prod_dist)
            
            drift_results[feature] = {
                "psi": round(psi, 4),
                "status": (
                    "no_drift" if psi < 0.1 else
                    "moderate_drift" if psi < 0.25 else
                    "significant_drift"
                ),
            }
        
        drifted = [f for f, r in drift_results.items() 
                   if r["status"] == "significant_drift"]
        
        if drifted:
            self.alert(
                severity="warning",
                message=f"Data drift detected in features: {drifted}",
                action="Investigate and consider retraining",
            )
        
        return drift_results
    
    def check_prediction_quality(self, window_days: int = 7):
        """Monitor model accuracy against ground truth."""
        predictions = self.get_recent_predictions(window_days)
        actuals = self.get_ground_truth(window_days)
        
        current_accuracy = self.calculate_accuracy(predictions, actuals)
        baseline_accuracy = self.get_baseline_accuracy()
        
        degradation = baseline_accuracy - current_accuracy
        
        if degradation > 0.05:  # >5% accuracy drop
            self.trigger_retraining(
                reason=f"Accuracy degraded by {degradation:.1%}",
                current=current_accuracy,
                baseline=baseline_accuracy,
            )

Anti-Patterns

Anti-Pattern	Consequence	Fix
Deploy model without monitoring	Silent degradation, wrong predictions	Monitor from day 1: drift, accuracy, latency
No ground truth collection	Cannot measure real accuracy	Collect labels continuously, even if delayed
Alert on every minor drift	Alert fatigue, ignore real issues	Thresholds: PSI > 0.25 = significant, investigate
Retrain on schedule only	Miss sudden drift, waste compute on stable models	Event-driven retraining triggered by drift detection
Monitor predictions but not features	Cannot diagnose WHY accuracy dropped	Monitor both input distributions and outputs

Model monitoring is the operational equivalent of testing in software. You would not deploy code without tests; you should not deploy models without monitoring. The cost of a degraded model is measured in wrong decisions — and those decisions compound silently.

Types of Drift

Monitoring Implementation

Anti-Patterns

More in AI Engineering

AI Agent Orchestration

AI Agent Tool Selection Optimization

AI Agent Architecture