Verified by Garnet Grid

How to Implement RAG (Retrieval-Augmented Generation)

Build production RAG pipelines. Covers chunking strategies, embedding models, vector stores, retrieval techniques, evaluation, and common failure modes.

RAG is the highest-ROI pattern for enterprise AI right now. It lets you ground LLM responses in your own data without the cost and complexity of fine-tuning. But naive implementations fail badly — wrong chunks retrieved, hallucinated answers, stale data. This guide covers the complete RAG pipeline from document ingestion to production evaluation, with practical techniques that separate working systems from demos.

The core idea: instead of training the model on your data, you retrieve relevant context at query time and include it in the prompt. This keeps the model general-purpose while grounding answers in your specific documents, policies, and knowledge base.


Architecture Overview

Offline Pipeline (Index Time):
Documents → Chunk → Embed → Store (Vector DB)

Online Pipeline (Query Time):
User Query → Embed → Search → Top-K Results → Rerank

                                    Prompt (Context + Query) → LLM → Response

Step 1: Document Ingestion and Chunking

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Chunking strategy matters more than model choice
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,           # Tokens (not characters)
    chunk_overlap=50,         # Overlap prevents context loss at boundaries
    separators=["\n\n", "\n", ". ", " ", ""],
    length_function=len,
)

# Process documents
chunks = splitter.split_documents(documents)

# Add metadata for filtering
for i, chunk in enumerate(chunks):
    chunk.metadata.update({
        "source": doc.metadata["filename"],
        "chunk_index": i,
        "doc_type": "policy",      # Enables filtered search
        "department": "engineering",
        "last_updated": "2025-11-15",  # Track freshness
    })

Chunking Decision Matrix

Document TypeChunk SizeOverlapStrategyWhy
Technical docs512 tokens50Recursive (headers → paragraphs)Balance detail vs context
Legal / policy1024 tokens100Paragraph-levelNeed full clause context
Code filesFunction-level0AST-based splittingSemantic boundaries
Q&A / FAQ256 tokens0One chunk per questionSelf-contained answers
Emails / chat512 tokens25Message-levelConversation context
Tables / CSVRow-group0Row-based with headersKeep header → row mapping

Advanced: Parent-Child Chunking

# Small chunks for retrieval precision, large chunks for LLM context
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=0)

parent_chunks = parent_splitter.split_documents(documents)
for parent in parent_chunks:
    children = child_splitter.split_documents([parent])
    for child in children:
        child.metadata["parent_id"] = parent.metadata["id"]

# At query time: retrieve child chunks, but pass parent chunks to LLM
# This gives you precise retrieval AND sufficient context

Step 2: Embedding and Vector Storage

from openai import OpenAI
import numpy as np

client = OpenAI()

def embed_texts(texts: list[str], model="text-embedding-3-small") -> list:
    response = client.embeddings.create(input=texts, model=model)
    return [item.embedding for item in response.data]

Embedding Model Comparison

ModelDimensionsCost (per 1M tokens)QualityBest For
text-embedding-3-small1536$0.02GoodCost-sensitive, general use
text-embedding-3-large3072$0.13BestHigh-quality retrieval
text-embedding-ada-0021536$0.10LegacyBackward compatibility only
Cohere embed-v31024$0.10ExcellentMultilingual, compression
Open source (e5-large-v2)1024Free (self-hosted)GoodPrivacy, no API dependency

Pinecone Integration

from pinecone import Pinecone

pc = Pinecone(api_key="your-api-key")
index = pc.Index("knowledge-base")

# Upsert embeddings with metadata
vectors = []
for i, chunk in enumerate(chunks):
    embedding = embed_texts([chunk.page_content])[0]
    vectors.append({
        "id": f"doc-{chunk.metadata['source']}-{i}",
        "values": embedding,
        "metadata": {
            "text": chunk.page_content,
            "source": chunk.metadata["source"],
            "doc_type": chunk.metadata["doc_type"],
            "last_updated": chunk.metadata["last_updated"],
        }
    })

# Batch upsert (100 vectors at a time for performance)
for i in range(0, len(vectors), 100):
    index.upsert(vectors=vectors[i:i+100], namespace="docs")

Step 3: Retrieval

def retrieve(query: str, top_k: int = 5, filters: dict = None) -> list:
    query_embedding = embed_texts([query])[0]

    results = index.query(
        vector=query_embedding,
        top_k=top_k,
        include_metadata=True,
        filter=filters,          # e.g., {"doc_type": "policy"}
        namespace="docs",
    )

    return [
        {
            "text": match.metadata["text"],
            "source": match.metadata["source"],
            "score": match.score,
        }
        for match in results.matches
    ]

Hybrid Search (Dense + Sparse)

Combine semantic understanding (dense embeddings) with exact keyword matching (sparse/BM25). This handles both conceptual queries (“how do I handle errors?”) and specific terms (“CORS configuration”).

# Combine semantic search with keyword matching
from pinecone_text.sparse import BM25Encoder

bm25 = BM25Encoder()
bm25.fit(corpus)

def hybrid_search(query, alpha=0.7):
    """alpha=1.0 is pure semantic, alpha=0.0 is pure keyword"""
    dense = embed_texts([query])[0]
    sparse = bm25.encode_queries(query)

    results = index.query(
        vector=dense,
        sparse_vector=sparse,
        top_k=10,
        include_metadata=True,
    )
    return results

Reranking (Precision Boost)

Retrieve more results than needed with vector search, then rerank for precision:

# Retrieve 20, rerank to top 5
initial_results = retrieve(query, top_k=20)

reranked = cohere_client.rerank(
    model="rerank-english-v3.0",
    query=query,
    documents=[r["text"] for r in initial_results],
    top_n=5,
)

Step 4: Generation with Context

def rag_query(user_question: str) -> str:
    # Retrieve relevant chunks
    context_chunks = retrieve(user_question, top_k=5)
    context = "\n\n---\n\n".join([c["text"] for c in context_chunks])
    sources = list(set([c["source"] for c in context_chunks]))

    # Build prompt with grounding instructions
    prompt = f"""Answer the question based ONLY on the provided context.
If the context doesn't contain enough information, say "I don't have enough information to answer that."
Always cite which source document supports your answer.

Context:
{context}

Question: {user_question}

Answer:"""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.1,  # Low temperature for factual responses
    )

    return {
        "answer": response.choices[0].message.content,
        "sources": sources,
    }

Step 5: Evaluation

MetricWhat It MeasuresHow to CalculateTarget
Retrieval PrecisionAre retrieved chunks relevant?Manual review of top-K> 80%
Retrieval RecallAre all relevant chunks found?Compare to ground truth set> 70%
Answer FaithfulnessDoes the answer match the retrieved context?LLM-as-judge evaluation> 90%
Answer RelevanceDoes the answer address the user’s question?LLM-as-judge evaluation> 85%
Hallucination RateClaims not supported by context?Manual + LLM check< 5%
LatencyEnd-to-end response timeMeasure retrieve + generate< 3 seconds

Common Failure Modes

FailureCauseFix
Irrelevant retrievalChunks too large, wrong embedding modelReduce chunk size, try different embedding, add reranking
HallucinationsNo context match, model fills gaps confidentlyAdd “I don’t know” instruction, lower temperature, add citation requirement
Missing contextImportant info split across chunk boundariesIncrease overlap, use parent-child chunking
Stale answersSource documents not re-indexed after updatesAutomated re-indexing pipeline (daily or on-change)
Slow responseLarge context window, too many chunksReduce top-K, use reranking to filter, stream response
Wrong document type retrievedNo metadata filteringAdd doc_type metadata, filter at query time

When NOT to Use RAG

RAG isn’t always the right pattern. Avoid RAG when:

  • Data changes faster than you can index — Real-time stock prices, live scores, or sub-minute data freshness requirements are better served by direct API calls.
  • The knowledge base is tiny — If your entire corpus fits in a single prompt (under 100K tokens), just include it as context. No vector store needed.
  • You need deterministic answers — RAG introduces variability based on which chunks are retrieved. For tax calculations, regulatory compliance outputs, or structured data lookups, use traditional code.
  • Your queries are purely structured — “What was our Q3 revenue?” is a SQL query, not a RAG query. Don’t force natural language retrieval where structured queries are cleaner.

RAG vs Fine-Tuning Quick Decision

FactorUse RAGUse Fine-Tuning
Data changes frequently
Need source attribution
Private data, can’t send to APIConsider self-hosted✅ Self-hosted model
Need specific tone/style
Large knowledge base (10K+ docs)❌ (context limits)
Real-time web data❌ (use tool calling)
Budget-constrained✅ (cheaper)❌ (training costs)

RAG Checklist

  • Chunking strategy defined and tested per document type
  • Parent-child chunking evaluated for precision vs context trade-off
  • Embedding model selected and benchmarked on representative queries
  • Vector store deployed with metadata filtering configured
  • Hybrid search (dense + sparse) evaluated for keyword-heavy queries
  • Reranking tested for precision improvement
  • Retrieval tested (precision > 80% on sample queries)
  • Prompt template includes grounding instructions and citation requirements
  • Hallucination mitigation in place (low temperature, “I don’t know”)
  • Re-indexing pipeline for source document updates (automated)
  • Evaluation framework with ground truth dataset (50+ Q&A pairs)
  • Monitoring: track retrieval scores, latency, and user feedback
  • Cost projection: embedding + LLM costs per query at projected volume

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For AI engineering consulting, visit garnetgrid.com. :::

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →