Verified by Garnet Grid

AI Observability & Model Monitoring

Monitor AI/ML models in production with drift detection, performance tracking, prediction logging, alerting, and MLOps dashboards.

Deploying an ML model is easy. Keeping it working — reliably, without silent degradation — is the hard part. Models decay as feature distributions shift, upstream data pipelines break, and user behavior evolves. Without observability, you won’t know until customers complain. This guide covers how to build monitoring that catches problems before they reach users.


What to Monitor

LayerWhat Can Go WrongDetection Method
DataFeature drift, schema changes, missing valuesStatistical tests on input distributions
ModelAccuracy degradation, bias driftGround truth comparison, fairness metrics
InfrastructureLatency spikes, OOM, GPU failuresSystem metrics, health checks
BusinessConversion drops, user complaintsBusiness KPI tracking

Data Drift Detection

Your model was trained on historical data. When production distributions shift, the model extrapolates — predictions degrade silently. The Population Stability Index (PSI) is the industry standard:

from scipy import stats
import numpy as np

class DriftDetector:
    def __init__(self, reference_data, features):
        self.reference = reference_data
        self.features = features
    
    def detect_drift(self, production_data, significance=0.05):
        report = {"features": {}, "overall_drift": False}
        
        for feature in self.features:
            ref = self.reference[feature]
            prod = production_data[feature]
            
            if ref.dtype in ['float64', 'int64']:
                stat, p_value = stats.ks_2samp(ref, prod)
                psi = self._calculate_psi(ref, prod)
                drifted = p_value < significance or psi > 0.2
                
                report["features"][feature] = {
                    "ks_p_value": round(p_value, 6),
                    "psi": round(psi, 4),
                    "drifted": drifted,
                    "severity": "critical" if psi > 0.25 else "warning" if psi > 0.1 else "ok",
                }
            else:
                ref_counts = ref.value_counts()
                prod_counts = prod.value_counts()
                new_cats = list(set(prod.unique()) - set(ref.unique()))
                report["features"][feature] = {
                    "drifted": len(new_cats) > 0,
                    "new_categories": new_cats,
                }
        
        report["overall_drift"] = any(f["drifted"] for f in report["features"].values())
        return report
    
    @staticmethod
    def _calculate_psi(reference, production, bins=10):
        ref_hist, bin_edges = np.histogram(reference, bins=bins)
        prod_hist, _ = np.histogram(production, bins=bin_edges)
        ref_pct = (ref_hist + 1) / (len(reference) + bins)
        prod_pct = (prod_hist + 1) / (len(production) + bins)
        return np.sum((prod_pct - ref_pct) * np.log(prod_pct / ref_pct))
PSI ValueInterpretationAction
< 0.1No significant driftContinue monitoring
0.1 - 0.2Moderate driftInvestigate, increase frequency
0.2 - 0.25Significant driftEvaluate retraining
> 0.25Critical driftRetrain immediately

Prediction Logging

class PredictionLogger:
    def __init__(self, storage):
        self.storage = storage
    
    def log(self, request_id, model_version, prediction, confidence, latency_ms):
        self.storage.write({
            "request_id": request_id,
            "timestamp": datetime.utcnow().isoformat(),
            "model_version": model_version,
            "prediction": prediction,
            "confidence": confidence,
            "latency_ms": latency_ms,
        })
    
    def log_ground_truth(self, request_id, actual_value):
        """Log ground truth when available (often delayed)."""
        self.storage.update(request_id, {"actual": actual_value})

Alerting Framework

ALERT_RULES = [
    {"name": "accuracy_drop", "condition": "accuracy < baseline - 0.05", "severity": "critical"},
    {"name": "latency_spike", "condition": "p95_latency > baseline * 2", "severity": "warning"},
    {"name": "drift_detected", "condition": "psi_max > 0.2", "severity": "warning"},
    {"name": "low_confidence", "condition": "low_conf_pct > baseline * 1.5", "severity": "warning"},
    {"name": "volume_anomaly", "condition": "hourly_preds outside 0.5x-3x baseline", "severity": "info"},
]

For each alert type, define:

  1. Detection: What metric triggers it, what threshold
  2. Notification: Who gets paged (PagerDuty/Slack), at what severity
  3. Runbook: Step-by-step response procedure
  4. Resolution: How to confirm the issue is fixed

A/B Testing Models

class ModelABTest:
    def __init__(self, model_a, model_b, traffic_split=0.1):
        self.model_a = model_a   # Control (production)
        self.model_b = model_b   # Treatment (candidate)
        self.split = traffic_split
    
    def predict(self, features, request_id):
        in_treatment = hash(request_id) % 100 < self.split * 100
        model = self.model_b if in_treatment else self.model_a
        variant = "treatment" if in_treatment else "control"
        
        prediction = model.predict(features)
        log_ab_result(request_id, variant, prediction)
        return prediction

Performance Report Template

MetricCurrentBaselineStatus
Accuracy91.2%93.5%⚠️ -2.3%
Latency p50145ms130ms✅ Within range
Latency p95890ms450ms🔴 +97%
PSI (max feature)0.18< 0.1⚠️ Moderate drift
Low confidence (< 0.5)8.2%5.1%⚠️ Increasing
Prediction volume12.4K/hr11.8K/hr✅ Normal

Anti-Patterns

Anti-PatternProblemFix
No ground truth collectionCan’t measure actual production accuracyBuild delayed feedback pipeline
Monitoring only accuracyMiss drift, latency, and cost issuesMonitor all 4 layers: data, model, infra, business
Alert fatigueToo many non-actionable alertsTune thresholds, use severity levels
No baselineCan’t detect degradation without “normal”Establish baseline during validation, refresh quarterly
Ignoring confidenceTreating all predictions equallyTrack confidence distribution and calibration

Checklist

  • Prediction logging: inputs, outputs, confidence, latency captured
  • Data drift: KS test + PSI on all input features, automated
  • Model performance: accuracy tracked against delayed ground truth
  • Latency: p50/p95/p99 with alerting thresholds
  • Confidence calibration validated
  • Cost tracking: per-prediction and monthly dashboards
  • Alert rules: accuracy drop, drift, latency, volume anomaly
  • A/B testing framework for safe model rollouts
  • Centralized dashboard across data/model/infra/business
  • Runbooks documented for each alert type
  • Retention policy for prediction logs

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For ML monitoring consulting, visit garnetgrid.com. :::

JDR
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →