Model Serving Infrastructure at Scale

Serving ML models in production is fundamentally different from running inference in a notebook. In production, you’re managing GPU resources that cost $2-8/hour per card, handling bursty traffic that can spike 10x in minutes, maintaining latency SLAs while maximizing throughput, and operating across multiple model versions simultaneously. The infrastructure decisions you make here directly determine your AI product’s unit economics.

Most teams discover the hard way that model serving is 80% infrastructure engineering and 20% ML. The model itself is a small piece of a much larger system — and the infrastructure around it determines whether your AI product is profitable or a money pit.

The Serving Architecture

                    ┌──────────────┐
                    │   Load       │
                    │   Balancer   │
                    └──────┬───────┘
                           │
                    ┌──────▼───────┐
                    │   Model      │
                    │   Router     │
                    └──────┬───────┘
                    ┌──────┴──────────────┐
              ┌─────▼─────┐        ┌──────▼─────┐
              │  Model A   │        │  Model B    │
              │  (Primary) │        │  (Fallback) │
              └─────┬──────┘        └──────┬──────┘
              ┌─────▼──────┐        ┌──────▼──────┐
              │  GPU Pool   │        │  CPU Pool    │
              │  (T4/A10G)  │        │  (ARM/x86)   │
              └─────────────┘        └─────────────┘

Inference Optimization

Before scaling horizontally (more GPUs), optimize vertically (faster inference per GPU):

Quantization

Reduce model precision from FP32 → FP16 → INT8 → INT4. Each step roughly halves memory usage with modest accuracy trade-offs.

Precision	Memory	Speed	Accuracy Loss
FP32	1x	1x	Baseline
FP16	0.5x	1.5-2x	Negligible
INT8	0.25x	2-3x	< 1%
INT4	0.125x	3-4x	1-3%

For most production workloads, INT8 quantization hits the sweet spot: meaningful speedup with negligible quality impact.

Dynamic Batching

Individual inference requests are GPU-inefficient. Dynamic batching groups incoming requests and processes them together.

class DynamicBatcher:
    def __init__(self, max_batch_size=32, max_wait_ms=50):
        self.max_batch_size = max_batch_size
        self.max_wait_ms = max_wait_ms
        self.queue = asyncio.Queue()
        self.processing = False
    
    async def infer(self, input_data):
        future = asyncio.Future()
        await self.queue.put((input_data, future))
        
        if not self.processing:
            asyncio.create_task(self._process_batch())
        
        return await future
    
    async def _process_batch(self):
        self.processing = True
        batch = []
        deadline = time.time() + self.max_wait_ms / 1000
        
        while len(batch) < self.max_batch_size and time.time() < deadline:
            try:
                item = await asyncio.wait_for(
                    self.queue.get(),
                    timeout=max(0, deadline - time.time())
                )
                batch.append(item)
            except asyncio.TimeoutError:
                break
        
        if batch:
            inputs = [item[0] for item in batch]
            results = await self.model.batch_infer(inputs)
            for (_, future), result in zip(batch, results):
                future.set_result(result)
        
        self.processing = False

The trade-off: batching adds latency (up to max_wait_ms) but dramatically improves throughput. Tune max_wait_ms based on your SLA — 50ms is a good starting point for most applications.

KV Cache Optimization

For autoregressive models (GPT-style), KV cache management is the primary bottleneck at scale. Techniques like PagedAttention (used in vLLM) can improve throughput 2-4x by managing KV cache memory like virtual memory pages.

Model Routing

Not every request needs your most expensive model. Route intelligently based on complexity, cost budget, and latency requirements.

class ModelRouter:
    def __init__(self):
        self.models = {
            "fast": SmallModel(),       # 7B, INT4, < 200ms
            "balanced": MediumModel(),  # 70B, INT8, < 2s  
            "premium": LargeModel(),    # API call, < 5s
        }
    
    async def route(self, request: InferenceRequest) -> str:
        # Simple requests → fast model
        if request.estimated_complexity < 0.3:
            return "fast"
        
        # Cost-sensitive requests → balanced model
        if request.cost_budget == "standard":
            return "balanced"
        
        # Complex or high-value requests → premium model
        return "premium"

At scale, intelligent routing can reduce inference costs by 40-60% while maintaining quality where it matters. The key metric: cost per quality-adjusted response.

GPU Resource Management

GPUs are expensive and scarce. Manage them like the premium resource they are.

Right-sizing: T4 GPUs ($0.50/hr) handle most INT8 inference workloads. Don’t default to A100s ($3/hr+) unless you need the memory or compute.

Auto-scaling: Scale GPU instances based on queue depth, not CPU metrics. A GPU instance at 30% utilization is wasting money; at 95% utilization, you’re risking latency SLA breaches.

Spot/Preemptible Instances: For batch inference (non-real-time), use spot instances at 60-70% discount. Build your batch pipeline to handle preemption gracefully.

Multi-tenancy: Run multiple small models on a single GPU using model multiplexing. A T4 with 16GB VRAM can serve 2-3 quantized 7B models simultaneously.

Monitoring and Alerting

Essential metrics for model serving infrastructure:

Metric	Alert Threshold	What It Indicates
P99 Latency	> 2x SLA	Infrastructure bottleneck
GPU Utilization	< 30% or > 90%	Over/under-provisioned
Queue Depth	> 100	Need more instances
Error Rate	> 1%	Model or infrastructure failure
Cost per Request	> budget	Routing or scaling issue
Batch Size (avg)	< 4	Batching not effective

Build dashboards that show cost and quality side by side. The goal isn’t maximum quality or minimum cost — it’s the optimal trade-off for your business.

Deployment Checklist

Quantize models to INT8 minimum for production serving
Implement dynamic batching with tuned wait times
Build a model router for cost-aware request steering
Use spot instances for batch workloads
Right-size GPU instances based on actual memory and compute needs
Monitor GPU utilization and auto-scale on queue depth
Cache frequent requests — embedding lookups and common queries
Load test at 2x expected peak before launch
Plan for graceful degradation — fallback to smaller models under load
Track cost per request as a first-class metric alongside latency and accuracy