RAG Architecture Patterns: When Vector Search Is Not Enough

RAG — Retrieval-Augmented Generation — has become the default answer to “how do I make an LLM use my data.” The basic idea is simple: search a knowledge base, stuff the relevant chunks into the prompt, and let the model answer. Tutorials make this look like 20 lines of code.

In production, it is 20 lines of code plus 6 months of debugging why the system confidently gives wrong answers, misses relevant documents, hallucinates citations, and costs $50,000/month in API calls.

This guide covers the architecture decisions that separate demo-quality RAG from production-quality RAG.

The Naive RAG Pipeline (And Why It Fails)

Document → Chunk into 500 tokens → Embed with OpenAI → Store in vector DB
Query → Embed → Find top-5 similar chunks → Stuff into prompt → LLM answers

This works for demos. Here is why it breaks in production:

Failure Mode	Why It Happens	How Often
Retrieves wrong chunks	Embedding similarity ≠ semantic relevance	Very common
Misses relevant info	Answer spans multiple chunks, none retrieved	Common
Hallucinates citations	Model references chunks that do not support its answer	Common
Exceeds context window	Too many chunks + long query = truncation	Moderate
Slow retrieval	Large vector index + no filtering = high latency	At scale
Stale data	Documents updated but embeddings not re-indexed	Common

Chunking: The Foundation Everyone Gets Wrong

How you split documents determines everything downstream. Bad chunking = bad retrieval = bad answers.

Chunking Strategies

Strategy	How It Works	Best For
Fixed-size	Split every N tokens with overlap	Simplest; reasonable baseline
Sentence-based	Split at sentence boundaries	Preserving complete thoughts
Paragraph-based	Split at paragraph boundaries	Structured documents
Semantic	Split when topic changes (embedding similarity)	Long documents with varied topics
Recursive	Try paragraph → sentence → fixed as fallback	General purpose
Document-aware	Use document structure (headers, sections)	Technical docs, wikis, manuals

Practical Chunking Configuration

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Production-tested defaults:
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,        # tokens, not characters
    chunk_overlap=50,      # ~10% overlap prevents splitting mid-thought
    separators=[
        "\n## ",           # Markdown H2 headers
        "\n### ",          # Markdown H3 headers
        "\n\n",            # Paragraph breaks
        "\n",              # Line breaks
        ". ",              # Sentences
        " ",               # Words (last resort)
    ],
    length_function=len,   # Replace with tiktoken for token counting
)

The chunk size tradeoff:

Smaller chunks (256 tokens): More precise retrieval, but answers may need information from multiple chunks. Increases retrieval complexity.

Larger chunks (1024 tokens): More context per chunk, but retrieval is less precise and you fit fewer chunks in the prompt.

Sweet spot for most cases: 400-600 tokens with 10% overlap.

Beyond Naive Vector Search: Hybrid Retrieval

Pure vector search compares embedding similarity, which captures semantic meaning but misses exact matches. Keyword search finds exact terms but misses synonyms and paraphrases. Production RAG uses both.

Hybrid Search Architecture

Query: "What is the SLA for our payment API?"

Vector Search (semantic):
  → Finds chunks about "service level agreements" and "uptime guarantees"
  → Misses: chunks that literally say "payment API SLA: 99.95%"

Keyword Search (BM25):
  → Finds chunks containing "SLA" and "payment API"
  → Misses: chunks about "availability commitments" (same concept, different words)

Hybrid Search (both):
  → Combines results using Reciprocal Rank Fusion (RRF)
  → Gets both semantic matches AND exact keyword matches
  → Answer: "The payment API SLA is 99.95% availability..."

# Reciprocal Rank Fusion: combine results from multiple search methods
def reciprocal_rank_fusion(result_lists: list[list], k: int = 60) -> list:
    """
    Merge multiple ranked result lists using RRF.
    k=60 is standard; higher k = less weight on top results.
    """
    scores = {}
    for result_list in result_lists:
        for rank, doc_id in enumerate(result_list):
            if doc_id not in scores:
                scores[doc_id] = 0
            scores[doc_id] += 1 / (rank + k)

    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

Reranking: The Step That Doubles Accuracy

After retrieval, you have 10-20 candidate chunks. Not all of them are actually relevant. A reranker model scores each chunk against the original query and re-orders them by actual relevance.

Without reranking:
  Top 5 chunks by embedding similarity → Some are relevant, some are noise
  LLM uses all 5 → Answer includes irrelevant information

With reranking:
  Top 20 chunks by embedding similarity (cast a wide net)
  → Reranker scores each against the query
  → Top 5 by reranker score → All highly relevant
  → LLM uses 5 high-quality chunks → Better answer

Reranker	Speed	Quality	Cost
Cohere Rerank	Fast (API)	High	$1/1K queries
BGE Reranker v2	Medium (local)	High	Free (self-hosted)
Cross-encoder (ms-marco)	Slow (local)	Highest	Free (self-hosted)
LLM-as-reranker (GPT-4)	Slow (API)	Very high	Expensive

Metadata Filtering: The Overlooked Optimization

Vector search alone searches everything. Adding metadata filters narrows the search space before similarity comparison, which is both faster and more accurate.

# Without metadata: search all 500K chunks
results = index.query(
    vector=query_embedding,
    top_k=10
)

# With metadata: search only relevant subset
results = index.query(
    vector=query_embedding,
    top_k=10,
    filter={
        "source": "engineering-handbook",
        "category": "infrastructure",
        "last_updated": {"$gte": "2024-01-01"}
    }
)

Essential Metadata Fields

Field	Why	Example
`source`	Filter by document source	”confluence”, “github”, “notion”
`category`	Topic-based filtering	”security”, “infrastructure”, “api”
`last_updated`	Freshness filtering	”2024-07-15”
`document_id`	Group chunks from same doc	”doc_abc123”
`chunk_index`	Retrieve surrounding context	0, 1, 2, 3…
`access_level`	Permission-based retrieval	”public”, “internal”, “confidential”

Evaluation: How to Know If Your RAG Works

Metric	What It Measures	How to Calculate
Retrieval Precision	Are the retrieved chunks relevant?	Relevant chunks / total retrieved
Retrieval Recall	Did we find all relevant chunks?	Retrieved relevant / total relevant
Faithfulness	Does the answer match the retrieved context?	LLM judge or human eval
Answer Relevance	Does the answer address the question?	LLM judge or human eval
Latency	How long does end-to-end take?	Timer from query to response
Cost per query	How much does each query cost?	API costs (embedding + LLM + reranking)

The minimum viable evaluation set: 50 question-answer pairs with labeled relevant documents. Run retrieval against this set after every pipeline change. If retrieval precision drops, your answer quality will drop — regardless of what the LLM does.