LLM Application Architecture: Beyond the API Call

Building an LLM application looks easy: call the API, get a response, show it to the user. Ship it in a weekend. Then production happens. Latency spikes to 8 seconds. Costs hit $5,000/month. The model hallucinates critical information. Users find prompt injection vulnerabilities. The model provider changes their API and your application breaks.

This guide covers the architecture patterns for building LLM applications that survive contact with production.

The Production LLM Stack

┌─────────────────────────────────────────────┐
│  USER INTERFACE                              │
│  (Streaming response, loading states)        │
├─────────────────────────────────────────────┤
│  APPLICATION LAYER                           │
│  ├─ Input validation & guardrails            │
│  ├─ Prompt construction & templating         │
│  ├─ Context injection (RAG, memory)          │
│  └─ Output parsing & validation              │
├─────────────────────────────────────────────┤
│  ROUTING & ORCHESTRATION                     │
│  ├─ Model selection (GPT-4 vs Claude vs local)│
│  ├─ Fallback chains                          │
│  ├─ Rate limiting & queuing                  │
│  └─ Caching layer                            │
├─────────────────────────────────────────────┤
│  MODEL PROVIDERS                             │
│  ├─ OpenAI  ├─ Anthropic  ├─ Local (Ollama)  │
│  └─ Fallback provider                        │
├─────────────────────────────────────────────┤
│  OBSERVABILITY                               │
│  ├─ Prompt/response logging                  │
│  ├─ Token usage & cost tracking              │
│  ├─ Latency monitoring                       │
│  └─ Quality evaluation                       │
└─────────────────────────────────────────────┘

Prompt Engineering Patterns

Template Architecture

# Bad: prompts as inline strings scattered across code
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": f"Summarize this: {text}"}]
)

# Good: prompts as versioned templates with clear structure
class PromptTemplate:
    def __init__(self, template: str, version: str):
        self.template = template
        self.version = version

    def render(self, **kwargs) -> str:
        return self.template.format(**kwargs)

SUMMARIZE_V2 = PromptTemplate(
    version="2.1",
    template="""You are a technical writer for a software engineering audience.

Summarize the following text in 3-5 bullet points.
Focus on actionable information and technical details.
Omit marketing language and filler.

Text to summarize:
---
{text}
---

Format each bullet point as: "• [Key point]: [Detail]"
"""
)

# Usage
prompt = SUMMARIZE_V2.render(text=user_input)

System Prompt Design

Element	Purpose	Example
Role	Who the model is	”You are a senior infrastructure engineer”
Constraints	What it must/must not do	”Never recommend deleting production data”
Format	Output structure	”Respond in JSON with keys: answer, confidence, sources”
Examples	Few-shot demonstrations	”Example input: … Example output: …”
Guardrails	Safety boundaries	”If unsure, say ‘I don’t know’ rather than guessing”

Model Routing: Right-Sizing Cost and Quality

Not every request needs GPT-4. Most requests can be handled by cheaper, faster models.

class ModelRouter:
    """Route requests to appropriate models based on complexity."""

    def route(self, request: dict) -> str:
        # Simple classification/extraction → cheap model
        if request['task'] in ['classify', 'extract', 'format']:
            return 'gpt-4o-mini'  # ~$0.15/1M tokens

        # Standard generation → balanced model
        if request['task'] in ['summarize', 'explain', 'translate']:
            return 'gpt-4o'      # ~$2.50/1M tokens

        # Complex reasoning, code gen → premium model
        if request['task'] in ['analyze', 'code_review', 'architecture']:
            return 'gpt-4o'      # ~$2.50/1M tokens

        # Long document processing → large context model
        if request.get('input_tokens', 0) > 50000:
            return 'claude-3-5-sonnet'  # 200K context

        return 'gpt-4o-mini'  # Default to cheapest

Model Tier	Use Cases	Cost (approx)	Latency
Small (GPT-4o-mini, Haiku)	Classification, extraction, formatting	$0.10-0.25/1M tokens	Fast
Medium (GPT-4o, Sonnet)	Summarization, generation, analysis	$2-5/1M tokens	Moderate
Large (o1, Opus)	Complex reasoning, multi-step planning	$10-15/1M tokens	Slow
Local (Llama, Mistral)	High-volume, privacy-sensitive	Infrastructure cost only	Variable

Caching: The Easiest Cost Reduction

Identical or semantically similar queries should not hit the API twice.

import hashlib
import redis
import json

class LLMCache:
    def __init__(self, redis_client: redis.Redis, ttl: int = 3600):
        self.redis = redis_client
        self.ttl = ttl

    def _cache_key(self, model: str, messages: list) -> str:
        content = json.dumps({'model': model, 'messages': messages}, sort_keys=True)
        return f"llm:cache:{hashlib.sha256(content.encode()).hexdigest()}"

    def get(self, model: str, messages: list) -> dict | None:
        key = self._cache_key(model, messages)
        cached = self.redis.get(key)
        if cached:
            return json.loads(cached)
        return None

    def set(self, model: str, messages: list, response: dict):
        key = self._cache_key(model, messages)
        self.redis.setex(key, self.ttl, json.dumps(response))

# Usage
cache = LLMCache(redis_client)
cached = cache.get(model, messages)
if cached:
    return cached  # Free, instant

response = client.chat.completions.create(model=model, messages=messages)
cache.set(model, messages, response)
return response

Caching Strategies

Strategy	Hit Rate	Implementation
Exact match	Low-medium	Hash the full prompt
Semantic cache	High	Embed prompts, find similar cached responses
Prefix cache (provider-side)	Automatic	Long system prompts cached by provider
Response cache	High for repeated queries	Cache full response by input hash

Guardrails and Safety

class InputGuardrails:
    """Validate and sanitize user input before sending to LLM."""

    MAX_INPUT_LENGTH = 10000  # characters

    BLOCKED_PATTERNS = [
        r"ignore previous instructions",
        r"you are now",
        r"system prompt",
        r"forget your instructions",
    ]

    def validate(self, user_input: str) -> tuple[bool, str]:
        # Length check
        if len(user_input) > self.MAX_INPUT_LENGTH:
            return False, "Input too long"

        # Prompt injection detection
        lower = user_input.lower()
        for pattern in self.BLOCKED_PATTERNS:
            if re.search(pattern, lower):
                return False, "Input contains disallowed patterns"

        # PII detection (basic)
        if re.search(r'\b\d{3}-\d{2}-\d{4}\b', user_input):  # SSN
            return False, "Input may contain sensitive information"

        return True, "OK"

class OutputGuardrails:
    """Validate LLM output before returning to user."""

    def validate(self, output: str, context: dict) -> str:
        # Check for hallucinated URLs
        urls = re.findall(r'https?://\S+', output)
        for url in urls:
            if not self.is_known_domain(url):
                output = output.replace(url, '[link removed - unverified]')

        # Enforce output format if expected
        if context.get('expected_format') == 'json':
            try:
                json.loads(output)
            except json.JSONDecodeError:
                return '{"error": "Model output was not valid JSON"}'

        return output

Cost Monitoring

# Track cost per request
def track_llm_cost(model: str, input_tokens: int, output_tokens: int):
    PRICING = {
        'gpt-4o':       {'input': 2.50, 'output': 10.00},  # per 1M tokens
        'gpt-4o-mini':  {'input': 0.15, 'output': 0.60},
        'claude-3-5-sonnet': {'input': 3.00, 'output': 15.00},
    }

    prices = PRICING.get(model, {'input': 5.0, 'output': 15.0})
    cost = (input_tokens * prices['input'] + output_tokens * prices['output']) / 1_000_000

    metrics.track('llm.cost', cost, tags={'model': model})
    metrics.track('llm.tokens.input', input_tokens, tags={'model': model})
    metrics.track('llm.tokens.output', output_tokens, tags={'model': model})

    return cost

The Production LLM Stack

Prompt Engineering Patterns

Template Architecture

System Prompt Design

Model Routing: Right-Sizing Cost and Quality

Caching: The Easiest Cost Reduction

Caching Strategies

Guardrails and Safety

Cost Monitoring

Implementation Checklist

More in AI Engineering

AI Agent Orchestration

AI Agent Tool Selection Optimization

AI Agent Architecture