ML Model Deployment Patterns
Deploy ML models to production. Covers serving architectures, model versioning, A/B testing models, canary deployments, batch vs real-time inference, and model rollback strategies.
Training an ML model is 20% of the work. Deploying it reliably, monitoring its performance, and updating it safely is the other 80%. Most ML projects fail not because the model is bad, but because the team can’t get it into production and keep it running. This guide covers practical deployment patterns for production ML.
Deployment Architecture Patterns
| Pattern | Latency | Cost | Best For |
|---|---|---|---|
| REST API | Medium (10-100ms) | Per-request compute | General-purpose, moderate traffic |
| gRPC | Low (1-10ms) | Per-request compute | High-throughput, internal services |
| Batch inference | High (hours) | Cost-efficient (spot instances) | Recommendations, reports |
| Streaming | Low (continuous) | Always-on | Real-time fraud detection, anomaly detection |
| Edge | Very low (local) | Device compute | Mobile, IoT, offline capability |
| Embedded | Zero network | Library size | Client-side ML, browser-based |
Model Serving Architecture
┌──────────┐ ┌──────────────┐ ┌──────────────┐
│ API │────▶│ Model Router │────▶│ Model v2 │ 90% traffic
│ Gateway │ │ │ │ (production) │
│ │ │ • A/B test │ └──────────────┘
│ │ │ • Canary │
│ │ │ • Shadow │ ┌──────────────┐
└──────────┘ │ │────▶│ Model v3 │ 10% traffic
└──────────────┘ │ (canary) │
└──────────────┘
Deployment Strategies
Canary Deployment
model_deployment:
strategy: canary
stages:
- name: shadow
traffic: 0% # Run model, don't serve results
duration: 24h
validation:
- "latency_p99 < 200ms"
- "error_rate < 0.1%"
- name: canary
traffic: 5%
duration: 48h
validation:
- "accuracy >= baseline - 0.02"
- "latency_p99 < 200ms"
- "business_metric >= baseline"
- name: partial
traffic: 50%
duration: 72h
validation:
- "all previous + revenue impact neutral"
- name: full
traffic: 100%
rollback:
automatic: true
trigger: "any validation fails"
target: "previous_stable_version"
Model Versioning
| Component | Versioned? | How |
|---|---|---|
| Training data | Yes | DVC, LakeFS, or S3 versioned bucket |
| Feature pipeline | Yes | Git (code) + data version |
| Model artifact | Yes | MLflow, W&B, or model registry |
| Serving config | Yes | Git (inference config, preprocessing) |
| API contract | Yes | Semver for breaking input/output changes |
# Model registry entry
{
"model_name": "fraud_detector",
"version": "3.2.1",
"stage": "production",
"metrics": {
"auc_roc": 0.94,
"precision_at_95_recall": 0.87,
"inference_latency_p99_ms": 45
},
"training_data": "s3://data/fraud/v2024-03/",
"trained_at": "2025-03-01T10:00:00Z",
"promoted_at": "2025-03-05T14:00:00Z",
"promoted_by": "ml-ci-pipeline"
}
Anti-Patterns
| Anti-Pattern | Problem | Fix |
|---|---|---|
| Big bang model swap | If new model is worse, all users affected | Canary deployment with gradual rollout |
| No model versioning | Can’t reproduce or rollback | Model registry with full lineage |
| Training on laptop, serving in cloud | Environment mismatch, “works on my machine” | Containerized training + serving |
| No shadow testing | First users hit bugs | Shadow mode: run new model, compare to production |
| Batch model applied to real-time | Stale predictions, high latency | Match serving pattern to latency requirements |
Checklist
- Serving pattern selected (API, batch, streaming, edge)
- Model registry with versioning and lineage
- Canary deployment with automated validation
- Shadow testing before any production traffic
- Rollback: automated, < 5 minutes to previous version
- Monitoring: prediction distribution, latency, data drift
- A/B testing framework for model comparison
- Resource scaling: auto-scale based on inference load
:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For ML deployment consulting, visit garnetgrid.com. :::