How to Deploy an AI Agent in Enterprise: Architecture and Guardrails

Enterprise AI deployment fails 87% of the time — not because the models are bad, but because the surrounding architecture is missing. This guide covers the engineering you need around the LLM to make it production-ready: retrieval pipelines, guardrails, monitoring, cost control, and the deployment patterns that separate prototypes from systems your compliance team will approve.

The core mistake: teams spend 90% of effort on the model and 10% on everything else. Production AI is the opposite — the model is the easy part. The hard part is data pipelines, guardrails, monitoring, and cost control.

Step 1: Choose Your LLM Strategy

Decision Matrix

Factor	Self-Hosted (Ollama/vLLM)	API (OpenAI/Claude)	Fine-Tuned
Data Privacy	✅ Full control	⚠️ Data leaves premises	✅ Full control
Latency	✅ Low (local)	⚠️ Network dependent	✅ Low (if self-hosted)
Cost at Scale	✅ Fixed hardware cost	⚠️ Per-token billing	✅ Fixed after training
Model Quality	⚠️ Smaller models	✅ Frontier models	✅ Domain-optimized
Setup Effort	Medium	Low	High
Maintenance	High	Low	Medium

When to Use Each

Scenario	Best Option	Why
Prototype / MVP	API (OpenAI/Claude)	Fastest to market, best quality
Regulated industry (healthcare, finance)	Self-hosted	Data never leaves your network
High-volume, predictable queries	Self-hosted or fine-tuned	Per-token API costs become prohibitive
Need latest frontier capabilities	API	Self-hosted models lag behind
Domain-specific terminology (legal, medical)	Fine-tuned	Base models don’t understand your jargon
Cost-sensitive, simple tasks	Small self-hosted (7B-13B)	$0 per token after hardware investment

1.1 Self-Hosted with Ollama

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model
ollama pull llama3.1:70b

# Serve via API
ollama serve  # Listens on http://localhost:11434

# Test
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:70b",
  "prompt": "Explain Kubernetes pod scheduling in 3 sentences.",
  "stream": false
}'

Hardware Requirements for Self-Hosting

Model Size	VRAM Required	Example GPU	Inference Speed
7B params	8 GB	RTX 4070, Apple M2 16GB	30-50 tokens/sec
13B params	16 GB	RTX 4090, Apple M2 Pro 32GB	20-35 tokens/sec
70B params	48 GB	2x RTX 4090, A6000	10-20 tokens/sec
70B (quantized 4-bit)	32 GB	RTX 4090, Apple M3 Max	15-25 tokens/sec

1.2 API Integration Pattern

from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

def query_llm(prompt: str, system: str = "", model: str = "gpt-4o") -> str:
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": prompt}
        ],
        temperature=0.3,
        max_tokens=2000
    )
    return response.choices[0].message.content

Step 2: Build the RAG Pipeline

Retrieval-Augmented Generation (RAG) grounds LLM responses in your actual data instead of relying on training data alone.

2.1 Document Ingestion

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import DirectoryLoader

# Load documents
loader = DirectoryLoader("./knowledge_base/", glob="**/*.md")
docs = loader.load()

# Chunk documents
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " "]
)
chunks = splitter.split_documents(docs)
print(f"Created {len(chunks)} chunks from {len(docs)} documents")

Chunking Strategy Comparison

Strategy	Chunk Size	Overlap	Best For
Fixed-size	500-1000 tokens	10-20%	General purpose, fast
Recursive (by separators)	500-1000 tokens	10-20%	Structured docs (headers, paragraphs)
Semantic (by meaning)	Variable	None	Academic papers, dense content
Parent-child	Parent: 2000, Child: 500	None	Need both context and precision
Sentence-level	1-3 sentences	None	FAQ, Q&A datasets

2.2 Vector Store Setup

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    collection_name="enterprise_kb",
    persist_directory="./chroma_db"
)

2.3 RAG Query Chain

from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o", temperature=0.2)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(
        search_type="mmr",
        search_kwargs={"k": 5, "fetch_k": 20}
    ),
    return_source_documents=True
)

result = qa_chain({"query": "What is our SLA for P1 incidents?"})
print(result["result"])
print("Sources:", [doc.metadata["source"] for doc in result["source_documents"]])

Step 3: Implement Guardrails

Production AI needs safety nets. Without guardrails, your agent will eventually generate something that causes a compliance incident.

3.1 Input Validation

import re

BLOCKED_PATTERNS = [
    r'\b(password|secret|api.?key|ssn|credit.?card)\b',
    r'\b\d{3}-\d{2}-\d{4}\b',     # SSN pattern
    r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b',  # Credit card
]

def validate_input(user_input: str) -> tuple[bool, str]:
    """Check user input for PII and blocked content"""
    for pattern in BLOCKED_PATTERNS:
        if re.search(pattern, user_input, re.IGNORECASE):
            return False, f"Input contains potentially sensitive data."

    if len(user_input) > 10000:
        return False, "Input exceeds maximum length."

    return True, "OK"

3.2 Output Filtering

def filter_output(response: str) -> str:
    """Remove any PII that the model might hallucinate"""
    # Redact phone numbers
    response = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b',
                      '[REDACTED-PHONE]', response)
    # Redact email addresses
    response = re.sub(r'\b[\w.-]+@[\w.-]+\.\w+\b',
                      '[REDACTED-EMAIL]', response)
    # Redact SSNs
    response = re.sub(r'\b\d{3}-\d{2}-\d{4}\b',
                      '[REDACTED-SSN]', response)
    return response

3.3 Guardrail Categories

Guardrail Type	What It Prevents	Implementation
Input PII detection	Users sending sensitive data to LLM	Regex + NER model pre-processing
Output PII filtering	Model hallucinating real PII	Regex post-processing
Topic restriction	Off-topic usage (politics, medical advice)	System prompt + classifier
Prompt injection defense	Users trying to override system prompt	Input sanitization + instruction hierarchy
Hallucination detection	Model fabricating facts	Source attribution + confidence scoring
Rate limiting	Abuse, cost overrun	Per-user/per-minute token limits
Content moderation	Harmful or inappropriate output	OpenAI moderation API or custom classifier

3.4 Confidence Scoring

def check_confidence(result: dict) -> dict:
    """Add confidence metadata to responses"""
    sources = result.get("source_documents", [])

    confidence = "high" if len(sources) >= 3 else \
                 "medium" if len(sources) >= 1 else "low"

    result["confidence"] = confidence
    result["disclaimer"] = (
        "" if confidence == "high" else
        "⚠️ This response has limited source backing. Verify independently."
    )
    return result

Step 4: Monitor and Observe

4.1 Logging Pipeline

import json
import time
from datetime import datetime

def log_interaction(query, response, sources, latency, model):
    """Log every AI interaction for audit and improvement"""
    log_entry = {
        "timestamp": datetime.utcnow().isoformat(),
        "query_hash": hash(query),  # Don't log PII
        "response_length": len(response),
        "sources_count": len(sources),
        "model": model,
        "latency_ms": round(latency * 1000, 2),
        "confidence": check_confidence({"source_documents": sources})["confidence"]
    }

    # Append to JSONL log
    with open("ai_interactions.jsonl", "a") as f:
        f.write(json.dumps(log_entry) + "\n")

Key Metrics to Track

Metric	Target	Alert If
Response latency (P50)	< 2 seconds	> 5 seconds
Response latency (P99)	< 10 seconds	> 30 seconds
Confidence: high %	> 70%	< 50%
Source documents per query	≥ 3 average	< 1 average
User satisfaction (thumbs up %)	> 80%	< 60%
Cost per query	< $0.05	> $0.15
Error rate	< 1%	> 5%

4.2 Cost Tracking

COST_PER_1K_TOKENS = {
    "gpt-4o": {"input": 0.005, "output": 0.015},
    "gpt-4o-mini": {"input": 0.00015, "output": 0.0006},
    "claude-3.5-sonnet": {"input": 0.003, "output": 0.015},
}

def estimate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
    rates = COST_PER_1K_TOKENS.get(model, {"input": 0.01, "output": 0.03})
    return (input_tokens / 1000 * rates["input"] +
            output_tokens / 1000 * rates["output"])

Cost Optimization Strategies

Strategy	Savings	Trade-off
Use smaller model for simple queries (routing)	50-90%	Slightly lower quality on edge cases
Cache frequent query results	30-60%	Stale answers if data changes
Batch similar queries	20-40%	Higher latency
Reduce context window (fewer retrieved chunks)	20-50%	Lower recall
Quantized self-hosted models	70-95%	Slightly lower quality, setup effort

Step 5: Production Deployment

Containerized Deployment

FROM python:3.12-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .

# Health check endpoint
HEALTHCHECK --interval=30s --timeout=10s \
  CMD curl -f http://localhost:8000/health || exit 1

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

# docker-compose.yml
services:
  ai-agent:
    build: .
    ports: ["8000:8000"]
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - CHROMA_DB_PATH=/data/chroma
    volumes:
      - chroma_data:/data/chroma
    deploy:
      resources:
        limits:
          memory: 4G
          cpus: "2.0"
    restart: unless-stopped

  ollama:
    image: ollama/ollama
    ports: ["11434:11434"]
    volumes:
      - ollama_models:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]

Deployment Checklist

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For AI readiness assessments, visit garnetgrid.com. :::