RAG Architecture: Beyond Basic Retrieval
Build production-grade RAG systems. Covers chunking strategies, embedding models, hybrid search, reranking, query transformation, evaluation, and advanced patterns for enterprise retrieval-augmented generation.
Basic RAG — embed documents, throw them in a vector database, retrieve top-K chunks, paste them into a prompt — works for demos. It fails in production. Queries are ambiguous. Chunks lose context. Embeddings miss keywords. The LLM hallucinates despite having the right context. The retrieval quality degrades silently. Enterprise RAG requires engineering at every layer: chunking, embedding, retrieval, reranking, prompt construction, and evaluation.
This guide covers the architecture of production-grade RAG systems, with the engineering techniques that separate a toy prototype from a system that actually helps users find accurate answers.
RAG Architecture Layers
User Query
↓
┌── Query Processing ──────────┐
│ • Query expansion │
│ • Intent classification │
│ • Decomposition (multi-step) │
└──────────────────────────────┘
↓
┌── Retrieval ─────────────────┐
│ • Dense (semantic) search │
│ • Sparse (keyword) search │
│ • Hybrid (dense + sparse) │
│ • Metadata filtering │
└──────────────────────────────┘
↓
┌── Post-Retrieval ────────────┐
│ • Reranking │
│ • Deduplication │
│ • Context compression │
└──────────────────────────────┘
↓
┌── Generation ────────────────┐
│ • Prompt construction │
│ • Citation management │
│ • Grounding validation │
└──────────────────────────────┘
↓
Answer (with citations)
Chunking Strategies
The way you split documents determines retrieval quality more than any other decision:
| Strategy | Best For | Chunk Size | Overlap |
|---|---|---|---|
| Fixed-size | Consistent-format docs | 500-1000 tokens | 10-20% |
| Semantic (header-based) | Structured docs with headings | Variable | None |
| Recursive | Mixed-format content | 500-1500 tokens | 50-100 tokens |
| Sentence-window | Documents requiring precise retrieval | 1-3 sentences (with surrounding context stored) | N/A |
| Parent-child | Documents with hierarchies | Small (retrieval) → Large (context) | N/A |
Parent-Child Chunking
Retrieve on small, precise chunks but pass larger parent chunks to the LLM:
def create_parent_child_chunks(document, small_size=200, large_size=1000):
"""Create small chunks for retrieval, linked to large chunks for context."""
# Create parent (large) chunks
parents = split_text(document, chunk_size=large_size)
all_chunks = []
for parent_idx, parent in enumerate(parents):
parent_id = f"parent_{parent_idx}"
# Create child (small) chunks within each parent
children = split_text(parent, chunk_size=small_size)
for child_idx, child in enumerate(children):
all_chunks.append({
"id": f"child_{parent_idx}_{child_idx}",
"text": child, # Small: for embedding + retrieval
"parent_id": parent_id,
"parent_text": parent, # Large: for LLM context
"metadata": {
"source": document.source,
"page": document.page,
}
})
return all_chunks
# At retrieval time:
# 1. Search child chunks (small, precise)
# 2. Return parent chunks (large, full context) to LLM
Hybrid Search
Combine dense (semantic) and sparse (keyword) search:
def hybrid_search(query, collection, alpha=0.7):
"""Combine dense and sparse retrieval with weighted scoring."""
# Dense search (semantic similarity)
dense_results = collection.search(
query_embedding=embed(query),
top_k=20,
)
# Sparse search (BM25 keyword matching)
sparse_results = collection.bm25_search(
query=query,
top_k=20,
)
# Reciprocal Rank Fusion (RRF) for combining
combined = {}
for rank, result in enumerate(dense_results):
combined[result.id] = combined.get(result.id, 0) + alpha / (rank + 60)
for rank, result in enumerate(sparse_results):
combined[result.id] = combined.get(result.id, 0) + (1 - alpha) / (rank + 60)
# Sort by combined score
ranked = sorted(combined.items(), key=lambda x: x[1], reverse=True)
return ranked[:10]
| When To Lean Dense | When To Lean Sparse |
|---|---|
| Semantic/conceptual queries | Exact term matching (product codes, names) |
| “How does X compare to Y?" | "ERROR-5042 resolution” |
| Natural language questions | Keyword-heavy technical queries |
Reranking
Initial retrieval casts a wide net. Reranking picks the most relevant results:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")
def rerank_results(query, search_results, top_n=5):
"""Cross-encoder reranking for higher precision."""
pairs = [(query, result["text"]) for result in search_results]
scores = reranker.predict(pairs)
ranked = sorted(
zip(search_results, scores),
key=lambda x: x[1],
reverse=True,
)
return [result for result, score in ranked[:top_n]]
Impact: Reranking typically improves precision by 10-25% over embedding-only retrieval.
Query Processing
Query Decomposition
Complex queries should be broken into sub-queries:
def decompose_query(query):
prompt = f"""Break this complex question into simpler sub-questions that
can each be answered independently, then combined.
Question: {query}
Sub-questions (JSON array):"""
sub_queries = json.loads(llm.generate(prompt, temperature=0))
return sub_queries
# "How does our Q3 revenue compare to Q2, and what drove the difference?"
# → ["What was Q3 revenue?", "What was Q2 revenue?",
# "What factors changed between Q2 and Q3?"]
HyDE (Hypothetical Document Embeddings)
Generate a hypothetical answer, embed that, and search for similar real documents:
def hyde_search(query, collection):
"""Generate hypothetical answer, use it for retrieval."""
hypothetical = llm.generate(
f"Write a short, factual answer to: {query}",
temperature=0.5,
)
# Embed the hypothetical answer (not the query)
results = collection.search(
query_embedding=embed(hypothetical),
top_k=10,
)
return results
RAG Evaluation
| Metric | What It Measures | Target |
|---|---|---|
| Context Precision | % of retrieved chunks that are relevant | > 70% |
| Context Recall | % of relevant chunks that were retrieved | > 80% |
| Answer Faithfulness | Is the answer grounded in retrieved context? | > 90% |
| Answer Relevancy | Does the answer address the question? | > 85% |
| Hallucination Rate | % of claims not supported by context | < 5% |
def evaluate_rag(test_set, rag_pipeline):
results = {"faithfulness": [], "relevancy": [], "hallucinations": []}
for test in test_set:
answer, sources = rag_pipeline(test["query"])
# Check faithfulness
claims = extract_claims(answer)
grounded = sum(1 for c in claims if is_grounded(c, sources))
results["faithfulness"].append(grounded / len(claims) if claims else 1.0)
# Check relevancy
relevancy = judge_relevancy(test["query"], answer)
results["relevancy"].append(relevancy)
return {k: sum(v) / len(v) for k, v in results.items()}
Anti-Patterns
| Anti-Pattern | Problem | Fix |
|---|---|---|
| Naive chunking | Fixed 500-char chunks split mid-sentence | Use semantic chunking with sentence boundaries |
| Top-K only | Retrieval returns 5 chunks, 3 are irrelevant | Add reranking layer, use hybrid search |
| No metadata filtering | Search returns docs from wrong department/date | Add metadata filters before vector search |
| Stuffing all context | All chunks stuffed into prompt regardless of relevance | Compress context, use map-reduce for long contexts |
| No evaluation | No way to know if retrieval quality is degrading | Build eval set with ground truth, measure weekly |
| Vector DB as the answer | Treating vector search as the only retrieval method | Hybrid search (dense + sparse + metadata) |
Checklist
- Chunking strategy selected and tested (parent-child, semantic, sliding window)
- Embedding model benchmarked on domain-specific queries
- Hybrid search configured (dense + sparse + metadata filters)
- Reranking layer added with cross-encoder
- Query processing: decomposition, expansion, HyDE where appropriate
- Context window management: compression for long contexts
- Citation management: sources tracked through pipeline
- Evaluation dataset built (50+ query-answer pairs with ground truth)
- Faithfulness and relevancy metrics tracked
- Hallucination detection pipeline in place
- Monitoring: retrieval quality, latency, cost dashboards
- Re-indexing pipeline automated for updated source documents
:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For RAG architecture consulting, visit garnetgrid.com. :::