How to Implement RAG (Retrieval-Augmented Generation)
Build production RAG pipelines. Covers chunking strategies, embedding models, vector stores, retrieval techniques, evaluation, and common failure modes.
RAG is the highest-ROI pattern for enterprise AI right now. It lets you ground LLM responses in your own data without the cost and complexity of fine-tuning. But naive implementations fail badly — wrong chunks retrieved, hallucinated answers, stale data. This guide covers the complete RAG pipeline from document ingestion to production evaluation, with practical techniques that separate working systems from demos.
The core idea: instead of training the model on your data, you retrieve relevant context at query time and include it in the prompt. This keeps the model general-purpose while grounding answers in your specific documents, policies, and knowledge base.
Architecture Overview
Offline Pipeline (Index Time):
Documents → Chunk → Embed → Store (Vector DB)
Online Pipeline (Query Time):
User Query → Embed → Search → Top-K Results → Rerank
↓
Prompt (Context + Query) → LLM → Response
Step 1: Document Ingestion and Chunking
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Chunking strategy matters more than model choice
splitter = RecursiveCharacterTextSplitter(
chunk_size=512, # Tokens (not characters)
chunk_overlap=50, # Overlap prevents context loss at boundaries
separators=["\n\n", "\n", ". ", " ", ""],
length_function=len,
)
# Process documents
chunks = splitter.split_documents(documents)
# Add metadata for filtering
for i, chunk in enumerate(chunks):
chunk.metadata.update({
"source": doc.metadata["filename"],
"chunk_index": i,
"doc_type": "policy", # Enables filtered search
"department": "engineering",
"last_updated": "2025-11-15", # Track freshness
})
Chunking Decision Matrix
| Document Type | Chunk Size | Overlap | Strategy | Why |
|---|---|---|---|---|
| Technical docs | 512 tokens | 50 | Recursive (headers → paragraphs) | Balance detail vs context |
| Legal / policy | 1024 tokens | 100 | Paragraph-level | Need full clause context |
| Code files | Function-level | 0 | AST-based splitting | Semantic boundaries |
| Q&A / FAQ | 256 tokens | 0 | One chunk per question | Self-contained answers |
| Emails / chat | 512 tokens | 25 | Message-level | Conversation context |
| Tables / CSV | Row-group | 0 | Row-based with headers | Keep header → row mapping |
Advanced: Parent-Child Chunking
# Small chunks for retrieval precision, large chunks for LLM context
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=0)
parent_chunks = parent_splitter.split_documents(documents)
for parent in parent_chunks:
children = child_splitter.split_documents([parent])
for child in children:
child.metadata["parent_id"] = parent.metadata["id"]
# At query time: retrieve child chunks, but pass parent chunks to LLM
# This gives you precise retrieval AND sufficient context
Step 2: Embedding and Vector Storage
from openai import OpenAI
import numpy as np
client = OpenAI()
def embed_texts(texts: list[str], model="text-embedding-3-small") -> list:
response = client.embeddings.create(input=texts, model=model)
return [item.embedding for item in response.data]
Embedding Model Comparison
| Model | Dimensions | Cost (per 1M tokens) | Quality | Best For |
|---|---|---|---|---|
text-embedding-3-small | 1536 | $0.02 | Good | Cost-sensitive, general use |
text-embedding-3-large | 3072 | $0.13 | Best | High-quality retrieval |
text-embedding-ada-002 | 1536 | $0.10 | Legacy | Backward compatibility only |
Cohere embed-v3 | 1024 | $0.10 | Excellent | Multilingual, compression |
| Open source (e5-large-v2) | 1024 | Free (self-hosted) | Good | Privacy, no API dependency |
Pinecone Integration
from pinecone import Pinecone
pc = Pinecone(api_key="your-api-key")
index = pc.Index("knowledge-base")
# Upsert embeddings with metadata
vectors = []
for i, chunk in enumerate(chunks):
embedding = embed_texts([chunk.page_content])[0]
vectors.append({
"id": f"doc-{chunk.metadata['source']}-{i}",
"values": embedding,
"metadata": {
"text": chunk.page_content,
"source": chunk.metadata["source"],
"doc_type": chunk.metadata["doc_type"],
"last_updated": chunk.metadata["last_updated"],
}
})
# Batch upsert (100 vectors at a time for performance)
for i in range(0, len(vectors), 100):
index.upsert(vectors=vectors[i:i+100], namespace="docs")
Step 3: Retrieval
def retrieve(query: str, top_k: int = 5, filters: dict = None) -> list:
query_embedding = embed_texts([query])[0]
results = index.query(
vector=query_embedding,
top_k=top_k,
include_metadata=True,
filter=filters, # e.g., {"doc_type": "policy"}
namespace="docs",
)
return [
{
"text": match.metadata["text"],
"source": match.metadata["source"],
"score": match.score,
}
for match in results.matches
]
Hybrid Search (Dense + Sparse)
Combine semantic understanding (dense embeddings) with exact keyword matching (sparse/BM25). This handles both conceptual queries (“how do I handle errors?”) and specific terms (“CORS configuration”).
# Combine semantic search with keyword matching
from pinecone_text.sparse import BM25Encoder
bm25 = BM25Encoder()
bm25.fit(corpus)
def hybrid_search(query, alpha=0.7):
"""alpha=1.0 is pure semantic, alpha=0.0 is pure keyword"""
dense = embed_texts([query])[0]
sparse = bm25.encode_queries(query)
results = index.query(
vector=dense,
sparse_vector=sparse,
top_k=10,
include_metadata=True,
)
return results
Reranking (Precision Boost)
Retrieve more results than needed with vector search, then rerank for precision:
# Retrieve 20, rerank to top 5
initial_results = retrieve(query, top_k=20)
reranked = cohere_client.rerank(
model="rerank-english-v3.0",
query=query,
documents=[r["text"] for r in initial_results],
top_n=5,
)
Step 4: Generation with Context
def rag_query(user_question: str) -> str:
# Retrieve relevant chunks
context_chunks = retrieve(user_question, top_k=5)
context = "\n\n---\n\n".join([c["text"] for c in context_chunks])
sources = list(set([c["source"] for c in context_chunks]))
# Build prompt with grounding instructions
prompt = f"""Answer the question based ONLY on the provided context.
If the context doesn't contain enough information, say "I don't have enough information to answer that."
Always cite which source document supports your answer.
Context:
{context}
Question: {user_question}
Answer:"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
temperature=0.1, # Low temperature for factual responses
)
return {
"answer": response.choices[0].message.content,
"sources": sources,
}
Step 5: Evaluation
| Metric | What It Measures | How to Calculate | Target |
|---|---|---|---|
| Retrieval Precision | Are retrieved chunks relevant? | Manual review of top-K | > 80% |
| Retrieval Recall | Are all relevant chunks found? | Compare to ground truth set | > 70% |
| Answer Faithfulness | Does the answer match the retrieved context? | LLM-as-judge evaluation | > 90% |
| Answer Relevance | Does the answer address the user’s question? | LLM-as-judge evaluation | > 85% |
| Hallucination Rate | Claims not supported by context? | Manual + LLM check | < 5% |
| Latency | End-to-end response time | Measure retrieve + generate | < 3 seconds |
Common Failure Modes
| Failure | Cause | Fix |
|---|---|---|
| Irrelevant retrieval | Chunks too large, wrong embedding model | Reduce chunk size, try different embedding, add reranking |
| Hallucinations | No context match, model fills gaps confidently | Add “I don’t know” instruction, lower temperature, add citation requirement |
| Missing context | Important info split across chunk boundaries | Increase overlap, use parent-child chunking |
| Stale answers | Source documents not re-indexed after updates | Automated re-indexing pipeline (daily or on-change) |
| Slow response | Large context window, too many chunks | Reduce top-K, use reranking to filter, stream response |
| Wrong document type retrieved | No metadata filtering | Add doc_type metadata, filter at query time |
When NOT to Use RAG
RAG isn’t always the right pattern. Avoid RAG when:
- Data changes faster than you can index — Real-time stock prices, live scores, or sub-minute data freshness requirements are better served by direct API calls.
- The knowledge base is tiny — If your entire corpus fits in a single prompt (under 100K tokens), just include it as context. No vector store needed.
- You need deterministic answers — RAG introduces variability based on which chunks are retrieved. For tax calculations, regulatory compliance outputs, or structured data lookups, use traditional code.
- Your queries are purely structured — “What was our Q3 revenue?” is a SQL query, not a RAG query. Don’t force natural language retrieval where structured queries are cleaner.
RAG vs Fine-Tuning Quick Decision
| Factor | Use RAG | Use Fine-Tuning |
|---|---|---|
| Data changes frequently | ✅ | ❌ |
| Need source attribution | ✅ | ❌ |
| Private data, can’t send to API | Consider self-hosted | ✅ Self-hosted model |
| Need specific tone/style | ❌ | ✅ |
| Large knowledge base (10K+ docs) | ✅ | ❌ (context limits) |
| Real-time web data | ❌ (use tool calling) | ❌ |
| Budget-constrained | ✅ (cheaper) | ❌ (training costs) |
RAG Checklist
- Chunking strategy defined and tested per document type
- Parent-child chunking evaluated for precision vs context trade-off
- Embedding model selected and benchmarked on representative queries
- Vector store deployed with metadata filtering configured
- Hybrid search (dense + sparse) evaluated for keyword-heavy queries
- Reranking tested for precision improvement
- Retrieval tested (precision > 80% on sample queries)
- Prompt template includes grounding instructions and citation requirements
- Hallucination mitigation in place (low temperature, “I don’t know”)
- Re-indexing pipeline for source document updates (automated)
- Evaluation framework with ground truth dataset (50+ Q&A pairs)
- Monitoring: track retrieval scores, latency, and user feedback
- Cost projection: embedding + LLM costs per query at projected volume
:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For AI engineering consulting, visit garnetgrid.com. :::