Verified by Garnet Grid

LLM Application Architecture: Beyond the API Call

Design production LLM applications that are reliable, cost-efficient, and maintainable. Covers prompt engineering patterns, model routing, caching strategies, evaluation frameworks, and the operational patterns for running LLMs at scale.

Building an LLM application looks easy: call the API, get a response, show it to the user. Ship it in a weekend. Then production happens. Latency spikes to 8 seconds. Costs hit $5,000/month. The model hallucinates critical information. Users find prompt injection vulnerabilities. The model provider changes their API and your application breaks.

This guide covers the architecture patterns for building LLM applications that survive contact with production.


The Production LLM Stack

┌─────────────────────────────────────────────┐
│  USER INTERFACE                              │
│  (Streaming response, loading states)        │
├─────────────────────────────────────────────┤
│  APPLICATION LAYER                           │
│  ├─ Input validation & guardrails            │
│  ├─ Prompt construction & templating         │
│  ├─ Context injection (RAG, memory)          │
│  └─ Output parsing & validation              │
├─────────────────────────────────────────────┤
│  ROUTING & ORCHESTRATION                     │
│  ├─ Model selection (GPT-4 vs Claude vs local)│
│  ├─ Fallback chains                          │
│  ├─ Rate limiting & queuing                  │
│  └─ Caching layer                            │
├─────────────────────────────────────────────┤
│  MODEL PROVIDERS                             │
│  ├─ OpenAI  ├─ Anthropic  ├─ Local (Ollama)  │
│  └─ Fallback provider                        │
├─────────────────────────────────────────────┤
│  OBSERVABILITY                               │
│  ├─ Prompt/response logging                  │
│  ├─ Token usage & cost tracking              │
│  ├─ Latency monitoring                       │
│  └─ Quality evaluation                       │
└─────────────────────────────────────────────┘

Prompt Engineering Patterns

Template Architecture

# Bad: prompts as inline strings scattered across code
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": f"Summarize this: {text}"}]
)

# Good: prompts as versioned templates with clear structure
class PromptTemplate:
    def __init__(self, template: str, version: str):
        self.template = template
        self.version = version

    def render(self, **kwargs) -> str:
        return self.template.format(**kwargs)

SUMMARIZE_V2 = PromptTemplate(
    version="2.1",
    template="""You are a technical writer for a software engineering audience.

Summarize the following text in 3-5 bullet points.
Focus on actionable information and technical details.
Omit marketing language and filler.

Text to summarize:
---
{text}
---

Format each bullet point as: "• [Key point]: [Detail]"
"""
)

# Usage
prompt = SUMMARIZE_V2.render(text=user_input)

System Prompt Design

ElementPurposeExample
RoleWho the model is”You are a senior infrastructure engineer”
ConstraintsWhat it must/must not do”Never recommend deleting production data”
FormatOutput structure”Respond in JSON with keys: answer, confidence, sources”
ExamplesFew-shot demonstrations”Example input: … Example output: …”
GuardrailsSafety boundaries”If unsure, say ‘I don’t know’ rather than guessing”

Model Routing: Right-Sizing Cost and Quality

Not every request needs GPT-4. Most requests can be handled by cheaper, faster models.

class ModelRouter:
    """Route requests to appropriate models based on complexity."""

    def route(self, request: dict) -> str:
        # Simple classification/extraction → cheap model
        if request['task'] in ['classify', 'extract', 'format']:
            return 'gpt-4o-mini'  # ~$0.15/1M tokens

        # Standard generation → balanced model
        if request['task'] in ['summarize', 'explain', 'translate']:
            return 'gpt-4o'      # ~$2.50/1M tokens

        # Complex reasoning, code gen → premium model
        if request['task'] in ['analyze', 'code_review', 'architecture']:
            return 'gpt-4o'      # ~$2.50/1M tokens

        # Long document processing → large context model
        if request.get('input_tokens', 0) > 50000:
            return 'claude-3-5-sonnet'  # 200K context

        return 'gpt-4o-mini'  # Default to cheapest
Model TierUse CasesCost (approx)Latency
Small (GPT-4o-mini, Haiku)Classification, extraction, formatting$0.10-0.25/1M tokensFast
Medium (GPT-4o, Sonnet)Summarization, generation, analysis$2-5/1M tokensModerate
Large (o1, Opus)Complex reasoning, multi-step planning$10-15/1M tokensSlow
Local (Llama, Mistral)High-volume, privacy-sensitiveInfrastructure cost onlyVariable

Caching: The Easiest Cost Reduction

Identical or semantically similar queries should not hit the API twice.

import hashlib
import redis
import json

class LLMCache:
    def __init__(self, redis_client: redis.Redis, ttl: int = 3600):
        self.redis = redis_client
        self.ttl = ttl

    def _cache_key(self, model: str, messages: list) -> str:
        content = json.dumps({'model': model, 'messages': messages}, sort_keys=True)
        return f"llm:cache:{hashlib.sha256(content.encode()).hexdigest()}"

    def get(self, model: str, messages: list) -> dict | None:
        key = self._cache_key(model, messages)
        cached = self.redis.get(key)
        if cached:
            return json.loads(cached)
        return None

    def set(self, model: str, messages: list, response: dict):
        key = self._cache_key(model, messages)
        self.redis.setex(key, self.ttl, json.dumps(response))

# Usage
cache = LLMCache(redis_client)
cached = cache.get(model, messages)
if cached:
    return cached  # Free, instant

response = client.chat.completions.create(model=model, messages=messages)
cache.set(model, messages, response)
return response

Caching Strategies

StrategyHit RateImplementation
Exact matchLow-mediumHash the full prompt
Semantic cacheHighEmbed prompts, find similar cached responses
Prefix cache (provider-side)AutomaticLong system prompts cached by provider
Response cacheHigh for repeated queriesCache full response by input hash

Guardrails and Safety

class InputGuardrails:
    """Validate and sanitize user input before sending to LLM."""

    MAX_INPUT_LENGTH = 10000  # characters

    BLOCKED_PATTERNS = [
        r"ignore previous instructions",
        r"you are now",
        r"system prompt",
        r"forget your instructions",
    ]

    def validate(self, user_input: str) -> tuple[bool, str]:
        # Length check
        if len(user_input) > self.MAX_INPUT_LENGTH:
            return False, "Input too long"

        # Prompt injection detection
        lower = user_input.lower()
        for pattern in self.BLOCKED_PATTERNS:
            if re.search(pattern, lower):
                return False, "Input contains disallowed patterns"

        # PII detection (basic)
        if re.search(r'\b\d{3}-\d{2}-\d{4}\b', user_input):  # SSN
            return False, "Input may contain sensitive information"

        return True, "OK"

class OutputGuardrails:
    """Validate LLM output before returning to user."""

    def validate(self, output: str, context: dict) -> str:
        # Check for hallucinated URLs
        urls = re.findall(r'https?://\S+', output)
        for url in urls:
            if not self.is_known_domain(url):
                output = output.replace(url, '[link removed - unverified]')

        # Enforce output format if expected
        if context.get('expected_format') == 'json':
            try:
                json.loads(output)
            except json.JSONDecodeError:
                return '{"error": "Model output was not valid JSON"}'

        return output

Cost Monitoring

# Track cost per request
def track_llm_cost(model: str, input_tokens: int, output_tokens: int):
    PRICING = {
        'gpt-4o':       {'input': 2.50, 'output': 10.00},  # per 1M tokens
        'gpt-4o-mini':  {'input': 0.15, 'output': 0.60},
        'claude-3-5-sonnet': {'input': 3.00, 'output': 15.00},
    }

    prices = PRICING.get(model, {'input': 5.0, 'output': 15.0})
    cost = (input_tokens * prices['input'] + output_tokens * prices['output']) / 1_000_000

    metrics.track('llm.cost', cost, tags={'model': model})
    metrics.track('llm.tokens.input', input_tokens, tags={'model': model})
    metrics.track('llm.tokens.output', output_tokens, tags={'model': model})

    return cost

Implementation Checklist

  • Centralize prompts as versioned templates, never inline strings
  • Implement model routing: use cheap models for simple tasks, premium for complex
  • Add response caching for identical queries (exact match first, then semantic)
  • Build input guardrails: length limits, prompt injection detection, PII filtering
  • Build output guardrails: format validation, hallucination detection
  • Track cost per request: model, input tokens, output tokens, total cost
  • Implement fallback chains: if primary model fails, route to backup
  • Log every prompt/response pair for debugging and evaluation
  • Set cost alerts: daily and monthly budget thresholds
  • Build an evaluation set: 50+ test cases with expected outputs for regression testing
Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →