AI Model Monitoring and Drift Detection
Monitor deployed ML models for performance degradation and data drift. Covers feature drift detection, prediction monitoring, model staleness indicators, automated retraining triggers, and the patterns that ensure AI systems stay accurate after deployment.
An ML model that is 95% accurate at deployment can degrade to 70% within months — and nobody notices because there are no alerts. Model monitoring fills the gap between deployment and retraining by continuously measuring whether the model’s real-world performance matches its training performance. When it drifts, you know immediately.
Types of Drift
Data Drift (feature distribution changes):
Training data: User age distribution centered at 25-35
Production data: New market segment, ages 45-65
Result: Model has never seen this demographic, accuracy drops
Detection: Compare feature distributions over time
Metric: Population Stability Index (PSI), KL divergence
Concept Drift (relationship between features and target changes):
Training: High price → low conversion
Production (after inflation): High price → same conversion
Result: Model's learned relationship is wrong
Detection: Monitor prediction accuracy against ground truth
Metric: Accuracy, F1, AUC over time windows
Prediction Drift (model output distribution changes):
Training: 10% of predictions are "high risk"
Production: 30% of predictions are "high risk"
Result: Something changed — either data or real-world conditions
Detection: Monitor prediction distribution
Metric: Output histogram comparison, KS test
Performance Drift (speed or resource degradation):
Deployment: 50ms inference latency
After 6 months: 200ms inference latency
Result: Model or feature computation has become inefficient
Detection: Monitor latency, throughput, memory
Metric: P50/P99 latency, throughput
Monitoring Implementation
class ModelMonitor:
"""Production ML model monitoring."""
def check_data_drift(self, reference_data, production_data):
"""Detect feature distribution shifts."""
drift_results = {}
for feature in reference_data.columns:
ref_dist = reference_data[feature]
prod_dist = production_data[feature]
# Population Stability Index
psi = self.calculate_psi(ref_dist, prod_dist)
drift_results[feature] = {
"psi": round(psi, 4),
"status": (
"no_drift" if psi < 0.1 else
"moderate_drift" if psi < 0.25 else
"significant_drift"
),
}
drifted = [f for f, r in drift_results.items()
if r["status"] == "significant_drift"]
if drifted:
self.alert(
severity="warning",
message=f"Data drift detected in features: {drifted}",
action="Investigate and consider retraining",
)
return drift_results
def check_prediction_quality(self, window_days: int = 7):
"""Monitor model accuracy against ground truth."""
predictions = self.get_recent_predictions(window_days)
actuals = self.get_ground_truth(window_days)
current_accuracy = self.calculate_accuracy(predictions, actuals)
baseline_accuracy = self.get_baseline_accuracy()
degradation = baseline_accuracy - current_accuracy
if degradation > 0.05: # >5% accuracy drop
self.trigger_retraining(
reason=f"Accuracy degraded by {degradation:.1%}",
current=current_accuracy,
baseline=baseline_accuracy,
)
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| Deploy model without monitoring | Silent degradation, wrong predictions | Monitor from day 1: drift, accuracy, latency |
| No ground truth collection | Cannot measure real accuracy | Collect labels continuously, even if delayed |
| Alert on every minor drift | Alert fatigue, ignore real issues | Thresholds: PSI > 0.25 = significant, investigate |
| Retrain on schedule only | Miss sudden drift, waste compute on stable models | Event-driven retraining triggered by drift detection |
| Monitor predictions but not features | Cannot diagnose WHY accuracy dropped | Monitor both input distributions and outputs |
Model monitoring is the operational equivalent of testing in software. You would not deploy code without tests; you should not deploy models without monitoring. The cost of a degraded model is measured in wrong decisions — and those decisions compound silently.