How to Implement RAG (Retrieval-Augmented Generation)

RAG is the highest-ROI pattern for enterprise AI right now. It lets you ground LLM responses in your own data without the cost and complexity of fine-tuning. But naive implementations fail badly — wrong chunks retrieved, hallucinated answers, stale data. This guide covers the complete RAG pipeline from document ingestion to production evaluation, with practical techniques that separate working systems from demos.

The core idea: instead of training the model on your data, you retrieve relevant context at query time and include it in the prompt. This keeps the model general-purpose while grounding answers in your specific documents, policies, and knowledge base.

Architecture Overview

Offline Pipeline (Index Time):
Documents → Chunk → Embed → Store (Vector DB)

Online Pipeline (Query Time):
User Query → Embed → Search → Top-K Results → Rerank
                                                  ↓
                                    Prompt (Context + Query) → LLM → Response

Step 1: Document Ingestion and Chunking

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Chunking strategy matters more than model choice
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,           # Tokens (not characters)
    chunk_overlap=50,         # Overlap prevents context loss at boundaries
    separators=["\n\n", "\n", ". ", " ", ""],
    length_function=len,
)

# Process documents
chunks = splitter.split_documents(documents)

# Add metadata for filtering
for i, chunk in enumerate(chunks):
    chunk.metadata.update({
        "source": doc.metadata["filename"],
        "chunk_index": i,
        "doc_type": "policy",      # Enables filtered search
        "department": "engineering",
        "last_updated": "2025-11-15",  # Track freshness
    })

Chunking Decision Matrix

Document Type	Chunk Size	Overlap	Strategy	Why
Technical docs	512 tokens	50	Recursive (headers → paragraphs)	Balance detail vs context
Legal / policy	1024 tokens	100	Paragraph-level	Need full clause context
Code files	Function-level	0	AST-based splitting	Semantic boundaries
Q&A / FAQ	256 tokens	0	One chunk per question	Self-contained answers
Emails / chat	512 tokens	25	Message-level	Conversation context
Tables / CSV	Row-group	0	Row-based with headers	Keep header → row mapping

Advanced: Parent-Child Chunking

# Small chunks for retrieval precision, large chunks for LLM context
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=0)

parent_chunks = parent_splitter.split_documents(documents)
for parent in parent_chunks:
    children = child_splitter.split_documents([parent])
    for child in children:
        child.metadata["parent_id"] = parent.metadata["id"]

# At query time: retrieve child chunks, but pass parent chunks to LLM
# This gives you precise retrieval AND sufficient context

Step 2: Embedding and Vector Storage

from openai import OpenAI
import numpy as np

client = OpenAI()

def embed_texts(texts: list[str], model="text-embedding-3-small") -> list:
    response = client.embeddings.create(input=texts, model=model)
    return [item.embedding for item in response.data]

Embedding Model Comparison

Model	Dimensions	Cost (per 1M tokens)	Quality	Best For
`text-embedding-3-small`	1536	$0.02	Good	Cost-sensitive, general use
`text-embedding-3-large`	3072	$0.13	Best	High-quality retrieval
`text-embedding-ada-002`	1536	$0.10	Legacy	Backward compatibility only
Cohere `embed-v3`	1024	$0.10	Excellent	Multilingual, compression
Open source (e5-large-v2)	1024	Free (self-hosted)	Good	Privacy, no API dependency

Pinecone Integration

from pinecone import Pinecone

pc = Pinecone(api_key="your-api-key")
index = pc.Index("knowledge-base")

# Upsert embeddings with metadata
vectors = []
for i, chunk in enumerate(chunks):
    embedding = embed_texts([chunk.page_content])[0]
    vectors.append({
        "id": f"doc-{chunk.metadata['source']}-{i}",
        "values": embedding,
        "metadata": {
            "text": chunk.page_content,
            "source": chunk.metadata["source"],
            "doc_type": chunk.metadata["doc_type"],
            "last_updated": chunk.metadata["last_updated"],
        }
    })

# Batch upsert (100 vectors at a time for performance)
for i in range(0, len(vectors), 100):
    index.upsert(vectors=vectors[i:i+100], namespace="docs")

Step 3: Retrieval

def retrieve(query: str, top_k: int = 5, filters: dict = None) -> list:
    query_embedding = embed_texts([query])[0]

    results = index.query(
        vector=query_embedding,
        top_k=top_k,
        include_metadata=True,
        filter=filters,          # e.g., {"doc_type": "policy"}
        namespace="docs",
    )

    return [
        {
            "text": match.metadata["text"],
            "source": match.metadata["source"],
            "score": match.score,
        }
        for match in results.matches
    ]

Hybrid Search (Dense + Sparse)

Combine semantic understanding (dense embeddings) with exact keyword matching (sparse/BM25). This handles both conceptual queries (“how do I handle errors?”) and specific terms (“CORS configuration”).

# Combine semantic search with keyword matching
from pinecone_text.sparse import BM25Encoder

bm25 = BM25Encoder()
bm25.fit(corpus)

def hybrid_search(query, alpha=0.7):
    """alpha=1.0 is pure semantic, alpha=0.0 is pure keyword"""
    dense = embed_texts([query])[0]
    sparse = bm25.encode_queries(query)

    results = index.query(
        vector=dense,
        sparse_vector=sparse,
        top_k=10,
        include_metadata=True,
    )
    return results

Reranking (Precision Boost)

Retrieve more results than needed with vector search, then rerank for precision:

# Retrieve 20, rerank to top 5
initial_results = retrieve(query, top_k=20)

reranked = cohere_client.rerank(
    model="rerank-english-v3.0",
    query=query,
    documents=[r["text"] for r in initial_results],
    top_n=5,
)

Step 4: Generation with Context

def rag_query(user_question: str) -> str:
    # Retrieve relevant chunks
    context_chunks = retrieve(user_question, top_k=5)
    context = "\n\n---\n\n".join([c["text"] for c in context_chunks])
    sources = list(set([c["source"] for c in context_chunks]))

    # Build prompt with grounding instructions
    prompt = f"""Answer the question based ONLY on the provided context.
If the context doesn't contain enough information, say "I don't have enough information to answer that."
Always cite which source document supports your answer.

Context:
{context}

Question: {user_question}

Answer:"""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.1,  # Low temperature for factual responses
    )

    return {
        "answer": response.choices[0].message.content,
        "sources": sources,
    }

Step 5: Evaluation

Metric	What It Measures	How to Calculate	Target
Retrieval Precision	Are retrieved chunks relevant?	Manual review of top-K	> 80%
Retrieval Recall	Are all relevant chunks found?	Compare to ground truth set	> 70%
Answer Faithfulness	Does the answer match the retrieved context?	LLM-as-judge evaluation	> 90%
Answer Relevance	Does the answer address the user’s question?	LLM-as-judge evaluation	> 85%
Hallucination Rate	Claims not supported by context?	Manual + LLM check	< 5%
Latency	End-to-end response time	Measure retrieve + generate	< 3 seconds

Common Failure Modes

Failure	Cause	Fix
Irrelevant retrieval	Chunks too large, wrong embedding model	Reduce chunk size, try different embedding, add reranking
Hallucinations	No context match, model fills gaps confidently	Add “I don’t know” instruction, lower temperature, add citation requirement
Missing context	Important info split across chunk boundaries	Increase overlap, use parent-child chunking
Stale answers	Source documents not re-indexed after updates	Automated re-indexing pipeline (daily or on-change)
Slow response	Large context window, too many chunks	Reduce top-K, use reranking to filter, stream response
Wrong document type retrieved	No metadata filtering	Add doc_type metadata, filter at query time

When NOT to Use RAG

RAG isn’t always the right pattern. Avoid RAG when:

Data changes faster than you can index — Real-time stock prices, live scores, or sub-minute data freshness requirements are better served by direct API calls.
The knowledge base is tiny — If your entire corpus fits in a single prompt (under 100K tokens), just include it as context. No vector store needed.
You need deterministic answers — RAG introduces variability based on which chunks are retrieved. For tax calculations, regulatory compliance outputs, or structured data lookups, use traditional code.
Your queries are purely structured — “What was our Q3 revenue?” is a SQL query, not a RAG query. Don’t force natural language retrieval where structured queries are cleaner.

RAG vs Fine-Tuning Quick Decision

Factor	Use RAG	Use Fine-Tuning
Data changes frequently	✅	❌
Need source attribution	✅	❌
Private data, can’t send to API	Consider self-hosted	✅ Self-hosted model
Need specific tone/style	❌	✅
Large knowledge base (10K+ docs)	✅	❌ (context limits)
Real-time web data	❌ (use tool calling)	❌
Budget-constrained	✅ (cheaper)	❌ (training costs)

RAG Checklist

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For AI engineering consulting, visit garnetgrid.com. :::