Verified by Garnet Grid

Model Serving Infrastructure at Scale

How to build reliable model serving infrastructure for production AI. Covers inference optimization, GPU orchestration, batching strategies, model routing, and cost management.

Serving ML models in production is fundamentally different from running inference in a notebook. In production, you’re managing GPU resources that cost $2-8/hour per card, handling bursty traffic that can spike 10x in minutes, maintaining latency SLAs while maximizing throughput, and operating across multiple model versions simultaneously. The infrastructure decisions you make here directly determine your AI product’s unit economics.

Most teams discover the hard way that model serving is 80% infrastructure engineering and 20% ML. The model itself is a small piece of a much larger system — and the infrastructure around it determines whether your AI product is profitable or a money pit.


The Serving Architecture

                    ┌──────────────┐
                    │   Load       │
                    │   Balancer   │
                    └──────┬───────┘

                    ┌──────▼───────┐
                    │   Model      │
                    │   Router     │
                    └──────┬───────┘
                    ┌──────┴──────────────┐
              ┌─────▼─────┐        ┌──────▼─────┐
              │  Model A   │        │  Model B    │
              │  (Primary) │        │  (Fallback) │
              └─────┬──────┘        └──────┬──────┘
              ┌─────▼──────┐        ┌──────▼──────┐
              │  GPU Pool   │        │  CPU Pool    │
              │  (T4/A10G)  │        │  (ARM/x86)   │
              └─────────────┘        └─────────────┘

Inference Optimization

Before scaling horizontally (more GPUs), optimize vertically (faster inference per GPU):

Quantization

Reduce model precision from FP32 → FP16 → INT8 → INT4. Each step roughly halves memory usage with modest accuracy trade-offs.

PrecisionMemorySpeedAccuracy Loss
FP321x1xBaseline
FP160.5x1.5-2xNegligible
INT80.25x2-3x< 1%
INT40.125x3-4x1-3%

For most production workloads, INT8 quantization hits the sweet spot: meaningful speedup with negligible quality impact.

Dynamic Batching

Individual inference requests are GPU-inefficient. Dynamic batching groups incoming requests and processes them together.

class DynamicBatcher:
    def __init__(self, max_batch_size=32, max_wait_ms=50):
        self.max_batch_size = max_batch_size
        self.max_wait_ms = max_wait_ms
        self.queue = asyncio.Queue()
        self.processing = False
    
    async def infer(self, input_data):
        future = asyncio.Future()
        await self.queue.put((input_data, future))
        
        if not self.processing:
            asyncio.create_task(self._process_batch())
        
        return await future
    
    async def _process_batch(self):
        self.processing = True
        batch = []
        deadline = time.time() + self.max_wait_ms / 1000
        
        while len(batch) < self.max_batch_size and time.time() < deadline:
            try:
                item = await asyncio.wait_for(
                    self.queue.get(),
                    timeout=max(0, deadline - time.time())
                )
                batch.append(item)
            except asyncio.TimeoutError:
                break
        
        if batch:
            inputs = [item[0] for item in batch]
            results = await self.model.batch_infer(inputs)
            for (_, future), result in zip(batch, results):
                future.set_result(result)
        
        self.processing = False

The trade-off: batching adds latency (up to max_wait_ms) but dramatically improves throughput. Tune max_wait_ms based on your SLA — 50ms is a good starting point for most applications.

KV Cache Optimization

For autoregressive models (GPT-style), KV cache management is the primary bottleneck at scale. Techniques like PagedAttention (used in vLLM) can improve throughput 2-4x by managing KV cache memory like virtual memory pages.


Model Routing

Not every request needs your most expensive model. Route intelligently based on complexity, cost budget, and latency requirements.

class ModelRouter:
    def __init__(self):
        self.models = {
            "fast": SmallModel(),       # 7B, INT4, < 200ms
            "balanced": MediumModel(),  # 70B, INT8, < 2s  
            "premium": LargeModel(),    # API call, < 5s
        }
    
    async def route(self, request: InferenceRequest) -> str:
        # Simple requests → fast model
        if request.estimated_complexity < 0.3:
            return "fast"
        
        # Cost-sensitive requests → balanced model
        if request.cost_budget == "standard":
            return "balanced"
        
        # Complex or high-value requests → premium model
        return "premium"

At scale, intelligent routing can reduce inference costs by 40-60% while maintaining quality where it matters. The key metric: cost per quality-adjusted response.


GPU Resource Management

GPUs are expensive and scarce. Manage them like the premium resource they are.

Right-sizing: T4 GPUs ($0.50/hr) handle most INT8 inference workloads. Don’t default to A100s ($3/hr+) unless you need the memory or compute.

Auto-scaling: Scale GPU instances based on queue depth, not CPU metrics. A GPU instance at 30% utilization is wasting money; at 95% utilization, you’re risking latency SLA breaches.

Spot/Preemptible Instances: For batch inference (non-real-time), use spot instances at 60-70% discount. Build your batch pipeline to handle preemption gracefully.

Multi-tenancy: Run multiple small models on a single GPU using model multiplexing. A T4 with 16GB VRAM can serve 2-3 quantized 7B models simultaneously.


Monitoring and Alerting

Essential metrics for model serving infrastructure:

MetricAlert ThresholdWhat It Indicates
P99 Latency> 2x SLAInfrastructure bottleneck
GPU Utilization< 30% or > 90%Over/under-provisioned
Queue Depth> 100Need more instances
Error Rate> 1%Model or infrastructure failure
Cost per Request> budgetRouting or scaling issue
Batch Size (avg)< 4Batching not effective

Build dashboards that show cost and quality side by side. The goal isn’t maximum quality or minimum cost — it’s the optimal trade-off for your business.


Deployment Checklist

  1. Quantize models to INT8 minimum for production serving
  2. Implement dynamic batching with tuned wait times
  3. Build a model router for cost-aware request steering
  4. Use spot instances for batch workloads
  5. Right-size GPU instances based on actual memory and compute needs
  6. Monitor GPU utilization and auto-scale on queue depth
  7. Cache frequent requests — embedding lookups and common queries
  8. Load test at 2x expected peak before launch
  9. Plan for graceful degradation — fallback to smaller models under load
  10. Track cost per request as a first-class metric alongside latency and accuracy
Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →