RAG Architecture Patterns
Design production-ready Retrieval-Augmented Generation systems. Covers chunking strategies, embedding models, vector search, reranking, context window optimization, hybrid search, evaluation frameworks, and the patterns that make RAG reliable.
Retrieval-Augmented Generation (RAG) combines the reasoning ability of LLMs with the factual accuracy of a knowledge base. Instead of relying on the LLM’s training data (which is stale and may hallucinate), RAG retrieves relevant documents and includes them in the prompt context. This grounds the LLM’s response in real, up-to-date information.
RAG Architecture
User Query
↓
1. Query Processing
- Query rewriting / expansion
- Intent classification
↓
2. Retrieval
- Embed query → vector
- Search vector store (ANN)
- Keyword search (BM25)
- Hybrid = vector + keyword
↓
3. Reranking
- Cross-encoder reranker
- Score retrieved chunks by relevance
- Select top-K most relevant
↓
4. Context Assembly
- Format chunks for prompt
- Add system instructions
- Respect context window limit
↓
5. Generation
- LLM generates response
- Grounded in retrieved context
↓
6. Post-Processing
- Citation extraction
- Hallucination detection
- Response formatting
Chunking Strategies
# Fixed-size chunking (simple but breaks context)
def fixed_size_chunks(text, chunk_size=512, overlap=50):
chunks = []
for i in range(0, len(text), chunk_size - overlap):
chunks.append(text[i:i + chunk_size])
return chunks
# Semantic chunking (better context preservation)
def semantic_chunks(text, max_tokens=512):
"""Split on natural boundaries: paragraphs, sections, sentences."""
sections = text.split('\n\n')
chunks = []
current_chunk = []
current_size = 0
for section in sections:
section_tokens = count_tokens(section)
if current_size + section_tokens > max_tokens and current_chunk:
chunks.append('\n\n'.join(current_chunk))
current_chunk = [section]
current_size = section_tokens
else:
current_chunk.append(section)
current_size += section_tokens
if current_chunk:
chunks.append('\n\n'.join(current_chunk))
return chunks
# Document-aware chunking
# Split by: heading, page boundary, code block, table
# Preserve metadata: source file, section title, page number
Hybrid Search
from pinecone import Pinecone
# Combine dense (semantic) + sparse (keyword) search
pc = Pinecone()
index = pc.Index("knowledge-base")
# Dense vector from embedding model
dense_embedding = embedding_model.encode(query)
# Sparse vector from BM25
sparse_embedding = bm25_encoder.encode(query)
results = index.query(
vector=dense_embedding,
sparse_vector=sparse_embedding,
top_k=20,
include_metadata=True
)
# Rerank with cross-encoder for precision
reranked = reranker.rerank(
query=query,
documents=[r.metadata['text'] for r in results],
top_n=5
)
Context Assembly
def assemble_context(query, retrieved_chunks, max_context_tokens=4000):
"""Build prompt with retrieved context."""
system_prompt = """You are a helpful assistant. Answer the user's question
based ONLY on the provided context. If the context doesn't contain the
answer, say "I don't have enough information to answer that."
Always cite your sources using [Source: filename] format."""
context_parts = []
token_count = count_tokens(system_prompt + query)
for chunk in retrieved_chunks:
chunk_tokens = count_tokens(chunk['text'])
if token_count + chunk_tokens > max_context_tokens:
break
context_parts.append(
f"[Source: {chunk['source']}]\n{chunk['text']}"
)
token_count += chunk_tokens
context = "\n\n---\n\n".join(context_parts)
return [
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
]
Evaluation
| Metric | Measures | How |
|---|---|---|
| Retrieval Precision@K | Relevant docs in top K | Human-labeled relevance |
| Retrieval Recall | Coverage of relevant docs | All relevant docs found? |
| Answer Faithfulness | Grounded in context? | LLM-as-judge vs context |
| Answer Relevance | Actually answers question? | LLM-as-judge vs query |
| Hallucination Rate | Made-up information | Fact-check against sources |
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| No reranking | Irrelevant context pollutes response | Cross-encoder reranker after retrieval |
| Chunks too small | Lose context, fragmented answers | 256-512 token chunks with overlap |
| Chunks too large | Dilute relevance, waste context window | Split on natural boundaries |
| No source citation | Cannot verify answers | Include source metadata in prompt |
| Dense search only | Misses keyword-specific queries | Hybrid search (dense + sparse) |
RAG turns LLMs from unreliable knowledge sources into reliable answer engines — when the retrieval is good. The quality of a RAG system is 80% retrieval quality and 20% generation quality.