AI Observability & Model Monitoring
Monitor AI/ML models in production with drift detection, performance tracking, prediction logging, alerting, and MLOps dashboards.
Deploying an ML model is easy. Keeping it working — reliably, without silent degradation — is the hard part. Models decay as feature distributions shift, upstream data pipelines break, and user behavior evolves. Without observability, you won’t know until customers complain. This guide covers how to build monitoring that catches problems before they reach users.
What to Monitor
| Layer | What Can Go Wrong | Detection Method |
|---|---|---|
| Data | Feature drift, schema changes, missing values | Statistical tests on input distributions |
| Model | Accuracy degradation, bias drift | Ground truth comparison, fairness metrics |
| Infrastructure | Latency spikes, OOM, GPU failures | System metrics, health checks |
| Business | Conversion drops, user complaints | Business KPI tracking |
Data Drift Detection
Your model was trained on historical data. When production distributions shift, the model extrapolates — predictions degrade silently. The Population Stability Index (PSI) is the industry standard:
from scipy import stats
import numpy as np
class DriftDetector:
def __init__(self, reference_data, features):
self.reference = reference_data
self.features = features
def detect_drift(self, production_data, significance=0.05):
report = {"features": {}, "overall_drift": False}
for feature in self.features:
ref = self.reference[feature]
prod = production_data[feature]
if ref.dtype in ['float64', 'int64']:
stat, p_value = stats.ks_2samp(ref, prod)
psi = self._calculate_psi(ref, prod)
drifted = p_value < significance or psi > 0.2
report["features"][feature] = {
"ks_p_value": round(p_value, 6),
"psi": round(psi, 4),
"drifted": drifted,
"severity": "critical" if psi > 0.25 else "warning" if psi > 0.1 else "ok",
}
else:
ref_counts = ref.value_counts()
prod_counts = prod.value_counts()
new_cats = list(set(prod.unique()) - set(ref.unique()))
report["features"][feature] = {
"drifted": len(new_cats) > 0,
"new_categories": new_cats,
}
report["overall_drift"] = any(f["drifted"] for f in report["features"].values())
return report
@staticmethod
def _calculate_psi(reference, production, bins=10):
ref_hist, bin_edges = np.histogram(reference, bins=bins)
prod_hist, _ = np.histogram(production, bins=bin_edges)
ref_pct = (ref_hist + 1) / (len(reference) + bins)
prod_pct = (prod_hist + 1) / (len(production) + bins)
return np.sum((prod_pct - ref_pct) * np.log(prod_pct / ref_pct))
| PSI Value | Interpretation | Action |
|---|---|---|
| < 0.1 | No significant drift | Continue monitoring |
| 0.1 - 0.2 | Moderate drift | Investigate, increase frequency |
| 0.2 - 0.25 | Significant drift | Evaluate retraining |
| > 0.25 | Critical drift | Retrain immediately |
Prediction Logging
class PredictionLogger:
def __init__(self, storage):
self.storage = storage
def log(self, request_id, model_version, prediction, confidence, latency_ms):
self.storage.write({
"request_id": request_id,
"timestamp": datetime.utcnow().isoformat(),
"model_version": model_version,
"prediction": prediction,
"confidence": confidence,
"latency_ms": latency_ms,
})
def log_ground_truth(self, request_id, actual_value):
"""Log ground truth when available (often delayed)."""
self.storage.update(request_id, {"actual": actual_value})
Alerting Framework
ALERT_RULES = [
{"name": "accuracy_drop", "condition": "accuracy < baseline - 0.05", "severity": "critical"},
{"name": "latency_spike", "condition": "p95_latency > baseline * 2", "severity": "warning"},
{"name": "drift_detected", "condition": "psi_max > 0.2", "severity": "warning"},
{"name": "low_confidence", "condition": "low_conf_pct > baseline * 1.5", "severity": "warning"},
{"name": "volume_anomaly", "condition": "hourly_preds outside 0.5x-3x baseline", "severity": "info"},
]
For each alert type, define:
- Detection: What metric triggers it, what threshold
- Notification: Who gets paged (PagerDuty/Slack), at what severity
- Runbook: Step-by-step response procedure
- Resolution: How to confirm the issue is fixed
A/B Testing Models
class ModelABTest:
def __init__(self, model_a, model_b, traffic_split=0.1):
self.model_a = model_a # Control (production)
self.model_b = model_b # Treatment (candidate)
self.split = traffic_split
def predict(self, features, request_id):
in_treatment = hash(request_id) % 100 < self.split * 100
model = self.model_b if in_treatment else self.model_a
variant = "treatment" if in_treatment else "control"
prediction = model.predict(features)
log_ab_result(request_id, variant, prediction)
return prediction
Performance Report Template
| Metric | Current | Baseline | Status |
|---|---|---|---|
| Accuracy | 91.2% | 93.5% | ⚠️ -2.3% |
| Latency p50 | 145ms | 130ms | ✅ Within range |
| Latency p95 | 890ms | 450ms | 🔴 +97% |
| PSI (max feature) | 0.18 | < 0.1 | ⚠️ Moderate drift |
| Low confidence (< 0.5) | 8.2% | 5.1% | ⚠️ Increasing |
| Prediction volume | 12.4K/hr | 11.8K/hr | ✅ Normal |
Anti-Patterns
| Anti-Pattern | Problem | Fix |
|---|---|---|
| No ground truth collection | Can’t measure actual production accuracy | Build delayed feedback pipeline |
| Monitoring only accuracy | Miss drift, latency, and cost issues | Monitor all 4 layers: data, model, infra, business |
| Alert fatigue | Too many non-actionable alerts | Tune thresholds, use severity levels |
| No baseline | Can’t detect degradation without “normal” | Establish baseline during validation, refresh quarterly |
| Ignoring confidence | Treating all predictions equally | Track confidence distribution and calibration |
Checklist
- Prediction logging: inputs, outputs, confidence, latency captured
- Data drift: KS test + PSI on all input features, automated
- Model performance: accuracy tracked against delayed ground truth
- Latency: p50/p95/p99 with alerting thresholds
- Confidence calibration validated
- Cost tracking: per-prediction and monthly dashboards
- Alert rules: accuracy drop, drift, latency, volume anomaly
- A/B testing framework for safe model rollouts
- Centralized dashboard across data/model/infra/business
- Runbooks documented for each alert type
- Retention policy for prediction logs
:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For ML monitoring consulting, visit garnetgrid.com. :::