RAG Architecture Patterns: When Vector Search Is Not Enough
Design Retrieval-Augmented Generation systems that actually work in production. Covers chunking strategies, embedding models, hybrid search, reranking, evaluation metrics, and the failure modes that textbook RAG implementations ignore.
RAG — Retrieval-Augmented Generation — has become the default answer to “how do I make an LLM use my data.” The basic idea is simple: search a knowledge base, stuff the relevant chunks into the prompt, and let the model answer. Tutorials make this look like 20 lines of code.
In production, it is 20 lines of code plus 6 months of debugging why the system confidently gives wrong answers, misses relevant documents, hallucinates citations, and costs $50,000/month in API calls.
This guide covers the architecture decisions that separate demo-quality RAG from production-quality RAG.
The Naive RAG Pipeline (And Why It Fails)
Document → Chunk into 500 tokens → Embed with OpenAI → Store in vector DB
Query → Embed → Find top-5 similar chunks → Stuff into prompt → LLM answers
This works for demos. Here is why it breaks in production:
| Failure Mode | Why It Happens | How Often |
|---|---|---|
| Retrieves wrong chunks | Embedding similarity ≠ semantic relevance | Very common |
| Misses relevant info | Answer spans multiple chunks, none retrieved | Common |
| Hallucinates citations | Model references chunks that do not support its answer | Common |
| Exceeds context window | Too many chunks + long query = truncation | Moderate |
| Slow retrieval | Large vector index + no filtering = high latency | At scale |
| Stale data | Documents updated but embeddings not re-indexed | Common |
Chunking: The Foundation Everyone Gets Wrong
How you split documents determines everything downstream. Bad chunking = bad retrieval = bad answers.
Chunking Strategies
| Strategy | How It Works | Best For |
|---|---|---|
| Fixed-size | Split every N tokens with overlap | Simplest; reasonable baseline |
| Sentence-based | Split at sentence boundaries | Preserving complete thoughts |
| Paragraph-based | Split at paragraph boundaries | Structured documents |
| Semantic | Split when topic changes (embedding similarity) | Long documents with varied topics |
| Recursive | Try paragraph → sentence → fixed as fallback | General purpose |
| Document-aware | Use document structure (headers, sections) | Technical docs, wikis, manuals |
Practical Chunking Configuration
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Production-tested defaults:
splitter = RecursiveCharacterTextSplitter(
chunk_size=512, # tokens, not characters
chunk_overlap=50, # ~10% overlap prevents splitting mid-thought
separators=[
"\n## ", # Markdown H2 headers
"\n### ", # Markdown H3 headers
"\n\n", # Paragraph breaks
"\n", # Line breaks
". ", # Sentences
" ", # Words (last resort)
],
length_function=len, # Replace with tiktoken for token counting
)
The chunk size tradeoff:
- Smaller chunks (256 tokens): More precise retrieval, but answers may need information from multiple chunks. Increases retrieval complexity.
- Larger chunks (1024 tokens): More context per chunk, but retrieval is less precise and you fit fewer chunks in the prompt.
- Sweet spot for most cases: 400-600 tokens with 10% overlap.
Beyond Naive Vector Search: Hybrid Retrieval
Pure vector search compares embedding similarity, which captures semantic meaning but misses exact matches. Keyword search finds exact terms but misses synonyms and paraphrases. Production RAG uses both.
Hybrid Search Architecture
Query: "What is the SLA for our payment API?"
Vector Search (semantic):
→ Finds chunks about "service level agreements" and "uptime guarantees"
→ Misses: chunks that literally say "payment API SLA: 99.95%"
Keyword Search (BM25):
→ Finds chunks containing "SLA" and "payment API"
→ Misses: chunks about "availability commitments" (same concept, different words)
Hybrid Search (both):
→ Combines results using Reciprocal Rank Fusion (RRF)
→ Gets both semantic matches AND exact keyword matches
→ Answer: "The payment API SLA is 99.95% availability..."
# Reciprocal Rank Fusion: combine results from multiple search methods
def reciprocal_rank_fusion(result_lists: list[list], k: int = 60) -> list:
"""
Merge multiple ranked result lists using RRF.
k=60 is standard; higher k = less weight on top results.
"""
scores = {}
for result_list in result_lists:
for rank, doc_id in enumerate(result_list):
if doc_id not in scores:
scores[doc_id] = 0
scores[doc_id] += 1 / (rank + k)
return sorted(scores.items(), key=lambda x: x[1], reverse=True)
Reranking: The Step That Doubles Accuracy
After retrieval, you have 10-20 candidate chunks. Not all of them are actually relevant. A reranker model scores each chunk against the original query and re-orders them by actual relevance.
Without reranking:
Top 5 chunks by embedding similarity → Some are relevant, some are noise
LLM uses all 5 → Answer includes irrelevant information
With reranking:
Top 20 chunks by embedding similarity (cast a wide net)
→ Reranker scores each against the query
→ Top 5 by reranker score → All highly relevant
→ LLM uses 5 high-quality chunks → Better answer
| Reranker | Speed | Quality | Cost |
|---|---|---|---|
| Cohere Rerank | Fast (API) | High | $1/1K queries |
| BGE Reranker v2 | Medium (local) | High | Free (self-hosted) |
| Cross-encoder (ms-marco) | Slow (local) | Highest | Free (self-hosted) |
| LLM-as-reranker (GPT-4) | Slow (API) | Very high | Expensive |
Metadata Filtering: The Overlooked Optimization
Vector search alone searches everything. Adding metadata filters narrows the search space before similarity comparison, which is both faster and more accurate.
# Without metadata: search all 500K chunks
results = index.query(
vector=query_embedding,
top_k=10
)
# With metadata: search only relevant subset
results = index.query(
vector=query_embedding,
top_k=10,
filter={
"source": "engineering-handbook",
"category": "infrastructure",
"last_updated": {"$gte": "2024-01-01"}
}
)
Essential Metadata Fields
| Field | Why | Example |
|---|---|---|
source | Filter by document source | ”confluence”, “github”, “notion” |
category | Topic-based filtering | ”security”, “infrastructure”, “api” |
last_updated | Freshness filtering | ”2024-07-15” |
document_id | Group chunks from same doc | ”doc_abc123” |
chunk_index | Retrieve surrounding context | 0, 1, 2, 3… |
access_level | Permission-based retrieval | ”public”, “internal”, “confidential” |
Evaluation: How to Know If Your RAG Works
| Metric | What It Measures | How to Calculate |
|---|---|---|
| Retrieval Precision | Are the retrieved chunks relevant? | Relevant chunks / total retrieved |
| Retrieval Recall | Did we find all relevant chunks? | Retrieved relevant / total relevant |
| Faithfulness | Does the answer match the retrieved context? | LLM judge or human eval |
| Answer Relevance | Does the answer address the question? | LLM judge or human eval |
| Latency | How long does end-to-end take? | Timer from query to response |
| Cost per query | How much does each query cost? | API costs (embedding + LLM + reranking) |
The minimum viable evaluation set: 50 question-answer pairs with labeled relevant documents. Run retrieval against this set after every pipeline change. If retrieval precision drops, your answer quality will drop — regardless of what the LLM does.
Implementation Checklist
- Choose chunking strategy based on your document types (start with recursive, 512 tokens, 10% overlap)
- Implement hybrid search: vector (semantic) + keyword (BM25) with reciprocal rank fusion
- Add a reranking step after retrieval to filter noise from top results
- Attach metadata to every chunk: source, category, date, document ID, chunk index
- Build an evaluation set of 50+ QA pairs with labeled relevant documents
- Implement parent document retrieval: when a chunk matches, include surrounding chunks
- Add freshness filters: exclude chunks from outdated documents
- Monitor cost per query and set budget alerts
- Track retrieval latency: initial target < 500ms for retrieval, < 3s end-to-end
- Set up automated re-indexing when source documents change