Model Serving Infrastructure at Scale
How to build reliable model serving infrastructure for production AI. Covers inference optimization, GPU orchestration, batching strategies, model routing, and cost management.
Serving ML models in production is fundamentally different from running inference in a notebook. In production, you’re managing GPU resources that cost $2-8/hour per card, handling bursty traffic that can spike 10x in minutes, maintaining latency SLAs while maximizing throughput, and operating across multiple model versions simultaneously. The infrastructure decisions you make here directly determine your AI product’s unit economics.
Most teams discover the hard way that model serving is 80% infrastructure engineering and 20% ML. The model itself is a small piece of a much larger system — and the infrastructure around it determines whether your AI product is profitable or a money pit.
The Serving Architecture
┌──────────────┐
│ Load │
│ Balancer │
└──────┬───────┘
│
┌──────▼───────┐
│ Model │
│ Router │
└──────┬───────┘
┌──────┴──────────────┐
┌─────▼─────┐ ┌──────▼─────┐
│ Model A │ │ Model B │
│ (Primary) │ │ (Fallback) │
└─────┬──────┘ └──────┬──────┘
┌─────▼──────┐ ┌──────▼──────┐
│ GPU Pool │ │ CPU Pool │
│ (T4/A10G) │ │ (ARM/x86) │
└─────────────┘ └─────────────┘
Inference Optimization
Before scaling horizontally (more GPUs), optimize vertically (faster inference per GPU):
Quantization
Reduce model precision from FP32 → FP16 → INT8 → INT4. Each step roughly halves memory usage with modest accuracy trade-offs.
| Precision | Memory | Speed | Accuracy Loss |
|---|---|---|---|
| FP32 | 1x | 1x | Baseline |
| FP16 | 0.5x | 1.5-2x | Negligible |
| INT8 | 0.25x | 2-3x | < 1% |
| INT4 | 0.125x | 3-4x | 1-3% |
For most production workloads, INT8 quantization hits the sweet spot: meaningful speedup with negligible quality impact.
Dynamic Batching
Individual inference requests are GPU-inefficient. Dynamic batching groups incoming requests and processes them together.
class DynamicBatcher:
def __init__(self, max_batch_size=32, max_wait_ms=50):
self.max_batch_size = max_batch_size
self.max_wait_ms = max_wait_ms
self.queue = asyncio.Queue()
self.processing = False
async def infer(self, input_data):
future = asyncio.Future()
await self.queue.put((input_data, future))
if not self.processing:
asyncio.create_task(self._process_batch())
return await future
async def _process_batch(self):
self.processing = True
batch = []
deadline = time.time() + self.max_wait_ms / 1000
while len(batch) < self.max_batch_size and time.time() < deadline:
try:
item = await asyncio.wait_for(
self.queue.get(),
timeout=max(0, deadline - time.time())
)
batch.append(item)
except asyncio.TimeoutError:
break
if batch:
inputs = [item[0] for item in batch]
results = await self.model.batch_infer(inputs)
for (_, future), result in zip(batch, results):
future.set_result(result)
self.processing = False
The trade-off: batching adds latency (up to max_wait_ms) but dramatically improves throughput. Tune max_wait_ms based on your SLA — 50ms is a good starting point for most applications.
KV Cache Optimization
For autoregressive models (GPT-style), KV cache management is the primary bottleneck at scale. Techniques like PagedAttention (used in vLLM) can improve throughput 2-4x by managing KV cache memory like virtual memory pages.
Model Routing
Not every request needs your most expensive model. Route intelligently based on complexity, cost budget, and latency requirements.
class ModelRouter:
def __init__(self):
self.models = {
"fast": SmallModel(), # 7B, INT4, < 200ms
"balanced": MediumModel(), # 70B, INT8, < 2s
"premium": LargeModel(), # API call, < 5s
}
async def route(self, request: InferenceRequest) -> str:
# Simple requests → fast model
if request.estimated_complexity < 0.3:
return "fast"
# Cost-sensitive requests → balanced model
if request.cost_budget == "standard":
return "balanced"
# Complex or high-value requests → premium model
return "premium"
At scale, intelligent routing can reduce inference costs by 40-60% while maintaining quality where it matters. The key metric: cost per quality-adjusted response.
GPU Resource Management
GPUs are expensive and scarce. Manage them like the premium resource they are.
Right-sizing: T4 GPUs ($0.50/hr) handle most INT8 inference workloads. Don’t default to A100s ($3/hr+) unless you need the memory or compute.
Auto-scaling: Scale GPU instances based on queue depth, not CPU metrics. A GPU instance at 30% utilization is wasting money; at 95% utilization, you’re risking latency SLA breaches.
Spot/Preemptible Instances: For batch inference (non-real-time), use spot instances at 60-70% discount. Build your batch pipeline to handle preemption gracefully.
Multi-tenancy: Run multiple small models on a single GPU using model multiplexing. A T4 with 16GB VRAM can serve 2-3 quantized 7B models simultaneously.
Monitoring and Alerting
Essential metrics for model serving infrastructure:
| Metric | Alert Threshold | What It Indicates |
|---|---|---|
| P99 Latency | > 2x SLA | Infrastructure bottleneck |
| GPU Utilization | < 30% or > 90% | Over/under-provisioned |
| Queue Depth | > 100 | Need more instances |
| Error Rate | > 1% | Model or infrastructure failure |
| Cost per Request | > budget | Routing or scaling issue |
| Batch Size (avg) | < 4 | Batching not effective |
Build dashboards that show cost and quality side by side. The goal isn’t maximum quality or minimum cost — it’s the optimal trade-off for your business.
Deployment Checklist
- Quantize models to INT8 minimum for production serving
- Implement dynamic batching with tuned wait times
- Build a model router for cost-aware request steering
- Use spot instances for batch workloads
- Right-size GPU instances based on actual memory and compute needs
- Monitor GPU utilization and auto-scale on queue depth
- Cache frequent requests — embedding lookups and common queries
- Load test at 2x expected peak before launch
- Plan for graceful degradation — fallback to smaller models under load
- Track cost per request as a first-class metric alongside latency and accuracy