LLM Application Architecture: Beyond the API Call
Design production LLM applications that are reliable, cost-efficient, and maintainable. Covers prompt engineering patterns, model routing, caching strategies, evaluation frameworks, and the operational patterns for running LLMs at scale.
Building an LLM application looks easy: call the API, get a response, show it to the user. Ship it in a weekend. Then production happens. Latency spikes to 8 seconds. Costs hit $5,000/month. The model hallucinates critical information. Users find prompt injection vulnerabilities. The model provider changes their API and your application breaks.
This guide covers the architecture patterns for building LLM applications that survive contact with production.
The Production LLM Stack
┌─────────────────────────────────────────────┐
│ USER INTERFACE │
│ (Streaming response, loading states) │
├─────────────────────────────────────────────┤
│ APPLICATION LAYER │
│ ├─ Input validation & guardrails │
│ ├─ Prompt construction & templating │
│ ├─ Context injection (RAG, memory) │
│ └─ Output parsing & validation │
├─────────────────────────────────────────────┤
│ ROUTING & ORCHESTRATION │
│ ├─ Model selection (GPT-4 vs Claude vs local)│
│ ├─ Fallback chains │
│ ├─ Rate limiting & queuing │
│ └─ Caching layer │
├─────────────────────────────────────────────┤
│ MODEL PROVIDERS │
│ ├─ OpenAI ├─ Anthropic ├─ Local (Ollama) │
│ └─ Fallback provider │
├─────────────────────────────────────────────┤
│ OBSERVABILITY │
│ ├─ Prompt/response logging │
│ ├─ Token usage & cost tracking │
│ ├─ Latency monitoring │
│ └─ Quality evaluation │
└─────────────────────────────────────────────┘
Prompt Engineering Patterns
Template Architecture
# Bad: prompts as inline strings scattered across code
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": f"Summarize this: {text}"}]
)
# Good: prompts as versioned templates with clear structure
class PromptTemplate:
def __init__(self, template: str, version: str):
self.template = template
self.version = version
def render(self, **kwargs) -> str:
return self.template.format(**kwargs)
SUMMARIZE_V2 = PromptTemplate(
version="2.1",
template="""You are a technical writer for a software engineering audience.
Summarize the following text in 3-5 bullet points.
Focus on actionable information and technical details.
Omit marketing language and filler.
Text to summarize:
---
{text}
---
Format each bullet point as: "• [Key point]: [Detail]"
"""
)
# Usage
prompt = SUMMARIZE_V2.render(text=user_input)
System Prompt Design
| Element | Purpose | Example |
|---|---|---|
| Role | Who the model is | ”You are a senior infrastructure engineer” |
| Constraints | What it must/must not do | ”Never recommend deleting production data” |
| Format | Output structure | ”Respond in JSON with keys: answer, confidence, sources” |
| Examples | Few-shot demonstrations | ”Example input: … Example output: …” |
| Guardrails | Safety boundaries | ”If unsure, say ‘I don’t know’ rather than guessing” |
Model Routing: Right-Sizing Cost and Quality
Not every request needs GPT-4. Most requests can be handled by cheaper, faster models.
class ModelRouter:
"""Route requests to appropriate models based on complexity."""
def route(self, request: dict) -> str:
# Simple classification/extraction → cheap model
if request['task'] in ['classify', 'extract', 'format']:
return 'gpt-4o-mini' # ~$0.15/1M tokens
# Standard generation → balanced model
if request['task'] in ['summarize', 'explain', 'translate']:
return 'gpt-4o' # ~$2.50/1M tokens
# Complex reasoning, code gen → premium model
if request['task'] in ['analyze', 'code_review', 'architecture']:
return 'gpt-4o' # ~$2.50/1M tokens
# Long document processing → large context model
if request.get('input_tokens', 0) > 50000:
return 'claude-3-5-sonnet' # 200K context
return 'gpt-4o-mini' # Default to cheapest
| Model Tier | Use Cases | Cost (approx) | Latency |
|---|---|---|---|
| Small (GPT-4o-mini, Haiku) | Classification, extraction, formatting | $0.10-0.25/1M tokens | Fast |
| Medium (GPT-4o, Sonnet) | Summarization, generation, analysis | $2-5/1M tokens | Moderate |
| Large (o1, Opus) | Complex reasoning, multi-step planning | $10-15/1M tokens | Slow |
| Local (Llama, Mistral) | High-volume, privacy-sensitive | Infrastructure cost only | Variable |
Caching: The Easiest Cost Reduction
Identical or semantically similar queries should not hit the API twice.
import hashlib
import redis
import json
class LLMCache:
def __init__(self, redis_client: redis.Redis, ttl: int = 3600):
self.redis = redis_client
self.ttl = ttl
def _cache_key(self, model: str, messages: list) -> str:
content = json.dumps({'model': model, 'messages': messages}, sort_keys=True)
return f"llm:cache:{hashlib.sha256(content.encode()).hexdigest()}"
def get(self, model: str, messages: list) -> dict | None:
key = self._cache_key(model, messages)
cached = self.redis.get(key)
if cached:
return json.loads(cached)
return None
def set(self, model: str, messages: list, response: dict):
key = self._cache_key(model, messages)
self.redis.setex(key, self.ttl, json.dumps(response))
# Usage
cache = LLMCache(redis_client)
cached = cache.get(model, messages)
if cached:
return cached # Free, instant
response = client.chat.completions.create(model=model, messages=messages)
cache.set(model, messages, response)
return response
Caching Strategies
| Strategy | Hit Rate | Implementation |
|---|---|---|
| Exact match | Low-medium | Hash the full prompt |
| Semantic cache | High | Embed prompts, find similar cached responses |
| Prefix cache (provider-side) | Automatic | Long system prompts cached by provider |
| Response cache | High for repeated queries | Cache full response by input hash |
Guardrails and Safety
class InputGuardrails:
"""Validate and sanitize user input before sending to LLM."""
MAX_INPUT_LENGTH = 10000 # characters
BLOCKED_PATTERNS = [
r"ignore previous instructions",
r"you are now",
r"system prompt",
r"forget your instructions",
]
def validate(self, user_input: str) -> tuple[bool, str]:
# Length check
if len(user_input) > self.MAX_INPUT_LENGTH:
return False, "Input too long"
# Prompt injection detection
lower = user_input.lower()
for pattern in self.BLOCKED_PATTERNS:
if re.search(pattern, lower):
return False, "Input contains disallowed patterns"
# PII detection (basic)
if re.search(r'\b\d{3}-\d{2}-\d{4}\b', user_input): # SSN
return False, "Input may contain sensitive information"
return True, "OK"
class OutputGuardrails:
"""Validate LLM output before returning to user."""
def validate(self, output: str, context: dict) -> str:
# Check for hallucinated URLs
urls = re.findall(r'https?://\S+', output)
for url in urls:
if not self.is_known_domain(url):
output = output.replace(url, '[link removed - unverified]')
# Enforce output format if expected
if context.get('expected_format') == 'json':
try:
json.loads(output)
except json.JSONDecodeError:
return '{"error": "Model output was not valid JSON"}'
return output
Cost Monitoring
# Track cost per request
def track_llm_cost(model: str, input_tokens: int, output_tokens: int):
PRICING = {
'gpt-4o': {'input': 2.50, 'output': 10.00}, # per 1M tokens
'gpt-4o-mini': {'input': 0.15, 'output': 0.60},
'claude-3-5-sonnet': {'input': 3.00, 'output': 15.00},
}
prices = PRICING.get(model, {'input': 5.0, 'output': 15.0})
cost = (input_tokens * prices['input'] + output_tokens * prices['output']) / 1_000_000
metrics.track('llm.cost', cost, tags={'model': model})
metrics.track('llm.tokens.input', input_tokens, tags={'model': model})
metrics.track('llm.tokens.output', output_tokens, tags={'model': model})
return cost
Implementation Checklist
- Centralize prompts as versioned templates, never inline strings
- Implement model routing: use cheap models for simple tasks, premium for complex
- Add response caching for identical queries (exact match first, then semantic)
- Build input guardrails: length limits, prompt injection detection, PII filtering
- Build output guardrails: format validation, hallucination detection
- Track cost per request: model, input tokens, output tokens, total cost
- Implement fallback chains: if primary model fails, route to backup
- Log every prompt/response pair for debugging and evaluation
- Set cost alerts: daily and monthly budget thresholds
- Build an evaluation set: 50+ test cases with expected outputs for regression testing