ML Model Deployment Patterns

Training an ML model is 20% of the work. Deploying it reliably, monitoring its performance, and updating it safely is the other 80%. Most ML projects fail not because the model is bad, but because the team can’t get it into production and keep it running. This guide covers practical deployment patterns for production ML.

Deployment Architecture Patterns

Pattern	Latency	Cost	Best For
REST API	Medium (10-100ms)	Per-request compute	General-purpose, moderate traffic
gRPC	Low (1-10ms)	Per-request compute	High-throughput, internal services
Batch inference	High (hours)	Cost-efficient (spot instances)	Recommendations, reports
Streaming	Low (continuous)	Always-on	Real-time fraud detection, anomaly detection
Edge	Very low (local)	Device compute	Mobile, IoT, offline capability
Embedded	Zero network	Library size	Client-side ML, browser-based

Model Serving Architecture

┌──────────┐     ┌──────────────┐     ┌──────────────┐
│ API      │────▶│ Model Router │────▶│ Model v2     │ 90% traffic
│ Gateway  │     │              │     │ (production) │
│          │     │ • A/B test   │     └──────────────┘
│          │     │ • Canary     │
│          │     │ • Shadow     │     ┌──────────────┐
└──────────┘     │              │────▶│ Model v3     │ 10% traffic
                 └──────────────┘     │ (canary)     │
                                      └──────────────┘

Deployment Strategies

Canary Deployment

model_deployment:
  strategy: canary
  
  stages:
    - name: shadow
      traffic: 0%  # Run model, don't serve results
      duration: 24h
      validation:
        - "latency_p99 < 200ms"
        - "error_rate < 0.1%"
    
    - name: canary
      traffic: 5%
      duration: 48h
      validation:
        - "accuracy >= baseline - 0.02"
        - "latency_p99 < 200ms"
        - "business_metric >= baseline"
    
    - name: partial
      traffic: 50%
      duration: 72h
      validation:
        - "all previous + revenue impact neutral"
    
    - name: full
      traffic: 100%
      
  rollback:
    automatic: true
    trigger: "any validation fails"
    target: "previous_stable_version"

Model Versioning

Component	Versioned?	How
Training data	Yes	DVC, LakeFS, or S3 versioned bucket
Feature pipeline	Yes	Git (code) + data version
Model artifact	Yes	MLflow, W&B, or model registry
Serving config	Yes	Git (inference config, preprocessing)
API contract	Yes	Semver for breaking input/output changes

# Model registry entry
{
    "model_name": "fraud_detector",
    "version": "3.2.1",
    "stage": "production",
    "metrics": {
        "auc_roc": 0.94,
        "precision_at_95_recall": 0.87,
        "inference_latency_p99_ms": 45
    },
    "training_data": "s3://data/fraud/v2024-03/",
    "trained_at": "2025-03-01T10:00:00Z",
    "promoted_at": "2025-03-05T14:00:00Z",
    "promoted_by": "ml-ci-pipeline"
}

Anti-Patterns

Anti-Pattern	Problem	Fix
Big bang model swap	If new model is worse, all users affected	Canary deployment with gradual rollout
No model versioning	Can’t reproduce or rollback	Model registry with full lineage
Training on laptop, serving in cloud	Environment mismatch, “works on my machine”	Containerized training + serving
No shadow testing	First users hit bugs	Shadow mode: run new model, compare to production
Batch model applied to real-time	Stale predictions, high latency	Match serving pattern to latency requirements

Checklist

Serving pattern selected (API, batch, streaming, edge)
Model registry with versioning and lineage
Canary deployment with automated validation
Shadow testing before any production traffic
Rollback: automated, < 5 minutes to previous version
Monitoring: prediction distribution, latency, data drift
A/B testing framework for model comparison
Resource scaling: auto-scale based on inference load

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For ML deployment consulting, visit garnetgrid.com. :::

Deployment Architecture Patterns

Model Serving Architecture

Deployment Strategies

Canary Deployment

Model Versioning

Anti-Patterns

Checklist

More in AI & Machine Learning

Responsible AI: Bias Detection & Mitigation

Agentic AI: Orchestration Frameworks

AI Cost Optimization: GPU vs API vs Edge