Enterprise AI deployment fails 87% of the time — not because the models are bad, but because the surrounding architecture is missing. This guide covers the engineering you need around the LLM to make it production-ready: retrieval pipelines, guardrails, monitoring, cost control, and the deployment patterns that separate prototypes from systems your compliance team will approve.
The core mistake: teams spend 90% of effort on the model and 10% on everything else. Production AI is the opposite — the model is the easy part. The hard part is data pipelines, guardrails, monitoring, and cost control.
Step 1: Choose Your LLM Strategy
Decision Matrix
Factor Self-Hosted (Ollama/vLLM) API (OpenAI/Claude) Fine-Tuned Data Privacy ✅ Full control ⚠️ Data leaves premises ✅ Full control Latency ✅ Low (local) ⚠️ Network dependent ✅ Low (if self-hosted) Cost at Scale ✅ Fixed hardware cost ⚠️ Per-token billing ✅ Fixed after training Model Quality ⚠️ Smaller models ✅ Frontier models ✅ Domain-optimized Setup Effort Medium Low High Maintenance High Low Medium
When to Use Each
Scenario Best Option Why Prototype / MVP API (OpenAI/Claude) Fastest to market, best quality Regulated industry (healthcare, finance) Self-hosted Data never leaves your network High-volume, predictable queries Self-hosted or fine-tuned Per-token API costs become prohibitive Need latest frontier capabilities API Self-hosted models lag behind Domain-specific terminology (legal, medical) Fine-tuned Base models don’t understand your jargon Cost-sensitive, simple tasks Small self-hosted (7B-13B) $0 per token after hardware investment
1.1 Self-Hosted with Ollama
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull a model
ollama pull llama3.1:70b
# Serve via API
ollama serve # Listens on http://localhost:11434
# Test
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1:70b",
"prompt": "Explain Kubernetes pod scheduling in 3 sentences.",
"stream": false
}'
Hardware Requirements for Self-Hosting
Model Size VRAM Required Example GPU Inference Speed 7B params 8 GB RTX 4070, Apple M2 16GB 30-50 tokens/sec 13B params 16 GB RTX 4090, Apple M2 Pro 32GB 20-35 tokens/sec 70B params 48 GB 2x RTX 4090, A6000 10-20 tokens/sec 70B (quantized 4-bit) 32 GB RTX 4090, Apple M3 Max 15-25 tokens/sec
1.2 API Integration Pattern
from openai import OpenAI
import os
client = OpenAI( api_key = os.environ[ "OPENAI_API_KEY" ])
def query_llm (prompt: str , system: str = "" , model: str = "gpt-4o" ) -> str :
response = client.chat.completions.create(
model = model,
messages = [
{ "role" : "system" , "content" : system},
{ "role" : "user" , "content" : prompt}
],
temperature = 0.3 ,
max_tokens = 2000
)
return response.choices[ 0 ].message.content
Step 2: Build the RAG Pipeline
Retrieval-Augmented Generation (RAG) grounds LLM responses in your actual data instead of relying on training data alone.
2.1 Document Ingestion
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import DirectoryLoader
# Load documents
loader = DirectoryLoader( "./knowledge_base/" , glob = "**/*.md" )
docs = loader.load()
# Chunk documents
splitter = RecursiveCharacterTextSplitter(
chunk_size = 1000 ,
chunk_overlap = 200 ,
separators = [ " \n\n " , " \n " , ". " , " " ]
)
chunks = splitter.split_documents(docs)
print ( f "Created { len (chunks) } chunks from { len (docs) } documents" )
Chunking Strategy Comparison
Strategy Chunk Size Overlap Best For Fixed-size 500-1000 tokens 10-20% General purpose, fast Recursive (by separators) 500-1000 tokens 10-20% Structured docs (headers, paragraphs) Semantic (by meaning) Variable None Academic papers, dense content Parent-child Parent: 2000, Child: 500 None Need both context and precision Sentence-level 1-3 sentences None FAQ, Q&A datasets
2.2 Vector Store Setup
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
embeddings = OpenAIEmbeddings( model = "text-embedding-3-large" )
vectorstore = Chroma.from_documents(
documents = chunks,
embedding = embeddings,
collection_name = "enterprise_kb" ,
persist_directory = "./chroma_db"
)
2.3 RAG Query Chain
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI( model = "gpt-4o" , temperature = 0.2 )
qa_chain = RetrievalQA.from_chain_type(
llm = llm,
chain_type = "stuff" ,
retriever = vectorstore.as_retriever(
search_type = "mmr" ,
search_kwargs = { "k" : 5 , "fetch_k" : 20 }
),
return_source_documents = True
)
result = qa_chain({ "query" : "What is our SLA for P1 incidents?" })
print (result[ "result" ])
print ( "Sources:" , [doc.metadata[ "source" ] for doc in result[ "source_documents" ]])
Step 3: Implement Guardrails
Production AI needs safety nets. Without guardrails, your agent will eventually generate something that causes a compliance incident.
import re
BLOCKED_PATTERNS = [
r ' \b( password | secret | api . ? key | ssn | credit . ? card )\b ' ,
r ' \b\d {3} - \d {2} - \d {4} \b ' , # SSN pattern
r ' \b\d {4} [\s - ] ? \d {4} [\s - ] ? \d {4} [\s - ] ? \d {4} \b ' , # Credit card
]
def validate_input (user_input: str ) -> tuple[ bool , str ]:
"""Check user input for PII and blocked content"""
for pattern in BLOCKED_PATTERNS :
if re.search(pattern, user_input, re. IGNORECASE ):
return False , f "Input contains potentially sensitive data."
if len (user_input) > 10000 :
return False , "Input exceeds maximum length."
return True , "OK"
3.2 Output Filtering
def filter_output (response: str ) -> str :
"""Remove any PII that the model might hallucinate"""
# Redact phone numbers
response = re.sub( r ' \b\d {3} [ -. ] ? \d {3} [ -. ] ? \d {4} \b ' ,
'[REDACTED-PHONE]' , response)
# Redact email addresses
response = re.sub( r ' \b[\w .- ] + @ [\w .- ] + \. \w + \b ' ,
'[REDACTED-EMAIL]' , response)
# Redact SSNs
response = re.sub( r ' \b\d {3} - \d {2} - \d {4} \b ' ,
'[REDACTED-SSN]' , response)
return response
3.3 Guardrail Categories
Guardrail Type What It Prevents Implementation Input PII detection Users sending sensitive data to LLM Regex + NER model pre-processing Output PII filtering Model hallucinating real PII Regex post-processing Topic restriction Off-topic usage (politics, medical advice) System prompt + classifier Prompt injection defense Users trying to override system prompt Input sanitization + instruction hierarchy Hallucination detection Model fabricating facts Source attribution + confidence scoring Rate limiting Abuse, cost overrun Per-user/per-minute token limits Content moderation Harmful or inappropriate output OpenAI moderation API or custom classifier
3.4 Confidence Scoring
def check_confidence (result: dict ) -> dict :
"""Add confidence metadata to responses"""
sources = result.get( "source_documents" , [])
confidence = "high" if len (sources) >= 3 else \
"medium" if len (sources) >= 1 else "low"
result[ "confidence" ] = confidence
result[ "disclaimer" ] = (
"" if confidence == "high" else
"⚠️ This response has limited source backing. Verify independently."
)
return result
Step 4: Monitor and Observe
4.1 Logging Pipeline
import json
import time
from datetime import datetime
def log_interaction (query, response, sources, latency, model):
"""Log every AI interaction for audit and improvement"""
log_entry = {
"timestamp" : datetime.utcnow().isoformat(),
"query_hash" : hash (query), # Don't log PII
"response_length" : len (response),
"sources_count" : len (sources),
"model" : model,
"latency_ms" : round (latency * 1000 , 2 ),
"confidence" : check_confidence({ "source_documents" : sources})[ "confidence" ]
}
# Append to JSONL log
with open ( "ai_interactions.jsonl" , "a" ) as f:
f.write(json.dumps(log_entry) + " \n " )
Key Metrics to Track
Metric Target Alert If Response latency (P50) < 2 seconds > 5 seconds Response latency (P99) < 10 seconds > 30 seconds Confidence: high % > 70% < 50% Source documents per query ≥ 3 average < 1 average User satisfaction (thumbs up %) > 80% < 60% Cost per query < $0.05 > $0.15 Error rate < 1% > 5%
4.2 Cost Tracking
COST_PER_1K_TOKENS = {
"gpt-4o" : { "input" : 0.005 , "output" : 0.015 },
"gpt-4o-mini" : { "input" : 0.00015 , "output" : 0.0006 },
"claude-3.5-sonnet" : { "input" : 0.003 , "output" : 0.015 },
}
def estimate_cost (model: str , input_tokens: int , output_tokens: int ) -> float :
rates = COST_PER_1K_TOKENS .get(model, { "input" : 0.01 , "output" : 0.03 })
return (input_tokens / 1000 * rates[ "input" ] +
output_tokens / 1000 * rates[ "output" ])
Cost Optimization Strategies
Strategy Savings Trade-off Use smaller model for simple queries (routing) 50-90% Slightly lower quality on edge cases Cache frequent query results 30-60% Stale answers if data changes Batch similar queries 20-40% Higher latency Reduce context window (fewer retrieved chunks) 20-50% Lower recall Quantized self-hosted models 70-95% Slightly lower quality, setup effort
Step 5: Production Deployment
Containerized Deployment
FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
# Health check endpoint
HEALTHCHECK --interval=30s --timeout=10s \
CMD curl -f http://localhost:8000/health || exit 1
CMD [ "uvicorn" , "main:app" , "--host" , "0.0.0.0" , "--port" , "8000" ]
# docker-compose.yml
services :
ai-agent :
build : .
ports : [ "8000:8000" ]
environment :
- OPENAI_API_KEY=${OPENAI_API_KEY}
- CHROMA_DB_PATH=/data/chroma
volumes :
- chroma_data:/data/chroma
deploy :
resources :
limits :
memory : 4G
cpus : "2.0"
restart : unless-stopped
ollama :
image : ollama/ollama
ports : [ "11434:11434" ]
volumes :
- ollama_models:/root/.ollama
deploy :
resources :
reservations :
devices :
- capabilities : [ gpu ]
Deployment Checklist
LLM strategy chosen (self-hosted vs API vs fine-tuned) with cost model
Hardware sized for self-hosted (VRAM, CPU, storage)
RAG pipeline: ingestion, chunking strategy, embedding, vector store
Chunking strategy validated (test retrieval quality before launch)
Input validation (PII detection, length limits, prompt injection defense)
Output filtering (PII redaction, content moderation)
Guardrails: topic restriction, hallucination detection, confidence scoring
Interaction logging (audit trail with non-PII query hashes)
Cost tracking per query with daily/weekly budget alerts
Key metrics dashboard (latency, confidence, satisfaction, cost)
Rate limiting and per-user quotas configured
Health checks and monitoring with alerting
Containerized deployment with resource limits
Rollback plan for model updates (canary deployment)
Compliance review signed off (data handling, privacy, retention)
:::note[Source]
This guide is derived from operational intelligence at Garnet Grid Consulting . For AI readiness assessments, visit garnetgrid.com .
:::