RAG Architecture: Beyond Basic Retrieval

Basic RAG — embed documents, throw them in a vector database, retrieve top-K chunks, paste them into a prompt — works for demos. It fails in production. Queries are ambiguous. Chunks lose context. Embeddings miss keywords. The LLM hallucinates despite having the right context. The retrieval quality degrades silently. Enterprise RAG requires engineering at every layer: chunking, embedding, retrieval, reranking, prompt construction, and evaluation.

This guide covers the architecture of production-grade RAG systems, with the engineering techniques that separate a toy prototype from a system that actually helps users find accurate answers.

RAG Architecture Layers

User Query
    ↓
┌── Query Processing ──────────┐
│ • Query expansion             │
│ • Intent classification       │
│ • Decomposition (multi-step)  │
└──────────────────────────────┘
    ↓
┌── Retrieval ─────────────────┐
│ • Dense (semantic) search     │
│ • Sparse (keyword) search     │
│ • Hybrid (dense + sparse)     │
│ • Metadata filtering          │
└──────────────────────────────┘
    ↓
┌── Post-Retrieval ────────────┐
│ • Reranking                   │
│ • Deduplication               │
│ • Context compression         │
└──────────────────────────────┘
    ↓
┌── Generation ────────────────┐
│ • Prompt construction         │
│ • Citation management         │
│ • Grounding validation        │
└──────────────────────────────┘
    ↓
Answer (with citations)

Chunking Strategies

The way you split documents determines retrieval quality more than any other decision:

Strategy	Best For	Chunk Size	Overlap
Fixed-size	Consistent-format docs	500-1000 tokens	10-20%
Semantic (header-based)	Structured docs with headings	Variable	None
Recursive	Mixed-format content	500-1500 tokens	50-100 tokens
Sentence-window	Documents requiring precise retrieval	1-3 sentences (with surrounding context stored)	N/A
Parent-child	Documents with hierarchies	Small (retrieval) → Large (context)	N/A

Parent-Child Chunking

Retrieve on small, precise chunks but pass larger parent chunks to the LLM:

def create_parent_child_chunks(document, small_size=200, large_size=1000):
    """Create small chunks for retrieval, linked to large chunks for context."""
    
    # Create parent (large) chunks
    parents = split_text(document, chunk_size=large_size)
    
    all_chunks = []
    for parent_idx, parent in enumerate(parents):
        parent_id = f"parent_{parent_idx}"
        
        # Create child (small) chunks within each parent
        children = split_text(parent, chunk_size=small_size)
        
        for child_idx, child in enumerate(children):
            all_chunks.append({
                "id": f"child_{parent_idx}_{child_idx}",
                "text": child,           # Small: for embedding + retrieval
                "parent_id": parent_id,
                "parent_text": parent,    # Large: for LLM context
                "metadata": {
                    "source": document.source,
                    "page": document.page,
                }
            })
    
    return all_chunks

# At retrieval time:
# 1. Search child chunks (small, precise)
# 2. Return parent chunks (large, full context) to LLM

Hybrid Search

Combine dense (semantic) and sparse (keyword) search:

def hybrid_search(query, collection, alpha=0.7):
    """Combine dense and sparse retrieval with weighted scoring."""
    
    # Dense search (semantic similarity)
    dense_results = collection.search(
        query_embedding=embed(query),
        top_k=20,
    )
    
    # Sparse search (BM25 keyword matching)
    sparse_results = collection.bm25_search(
        query=query,
        top_k=20,
    )
    
    # Reciprocal Rank Fusion (RRF) for combining
    combined = {}
    for rank, result in enumerate(dense_results):
        combined[result.id] = combined.get(result.id, 0) + alpha / (rank + 60)
    
    for rank, result in enumerate(sparse_results):
        combined[result.id] = combined.get(result.id, 0) + (1 - alpha) / (rank + 60)
    
    # Sort by combined score
    ranked = sorted(combined.items(), key=lambda x: x[1], reverse=True)
    return ranked[:10]

When To Lean Dense	When To Lean Sparse
Semantic/conceptual queries	Exact term matching (product codes, names)
“How does X compare to Y?"	"ERROR-5042 resolution”
Natural language questions	Keyword-heavy technical queries

Reranking

Initial retrieval casts a wide net. Reranking picks the most relevant results:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")

def rerank_results(query, search_results, top_n=5):
    """Cross-encoder reranking for higher precision."""
    
    pairs = [(query, result["text"]) for result in search_results]
    scores = reranker.predict(pairs)
    
    ranked = sorted(
        zip(search_results, scores),
        key=lambda x: x[1],
        reverse=True,
    )
    
    return [result for result, score in ranked[:top_n]]

Impact: Reranking typically improves precision by 10-25% over embedding-only retrieval.

Query Processing

Query Decomposition

Complex queries should be broken into sub-queries:

def decompose_query(query):
    prompt = f"""Break this complex question into simpler sub-questions that
    can each be answered independently, then combined.
    
    Question: {query}
    
    Sub-questions (JSON array):"""
    
    sub_queries = json.loads(llm.generate(prompt, temperature=0))
    return sub_queries

# "How does our Q3 revenue compare to Q2, and what drove the difference?"
# → ["What was Q3 revenue?", "What was Q2 revenue?", 
#    "What factors changed between Q2 and Q3?"]

HyDE (Hypothetical Document Embeddings)

Generate a hypothetical answer, embed that, and search for similar real documents:

def hyde_search(query, collection):
    """Generate hypothetical answer, use it for retrieval."""
    hypothetical = llm.generate(
        f"Write a short, factual answer to: {query}",
        temperature=0.5,
    )
    
    # Embed the hypothetical answer (not the query)
    results = collection.search(
        query_embedding=embed(hypothetical),
        top_k=10,
    )
    
    return results

RAG Evaluation

Metric	What It Measures	Target
Context Precision	% of retrieved chunks that are relevant	> 70%
Context Recall	% of relevant chunks that were retrieved	> 80%
Answer Faithfulness	Is the answer grounded in retrieved context?	> 90%
Answer Relevancy	Does the answer address the question?	> 85%
Hallucination Rate	% of claims not supported by context	< 5%

def evaluate_rag(test_set, rag_pipeline):
    results = {"faithfulness": [], "relevancy": [], "hallucinations": []}
    
    for test in test_set:
        answer, sources = rag_pipeline(test["query"])
        
        # Check faithfulness
        claims = extract_claims(answer)
        grounded = sum(1 for c in claims if is_grounded(c, sources))
        results["faithfulness"].append(grounded / len(claims) if claims else 1.0)
        
        # Check relevancy
        relevancy = judge_relevancy(test["query"], answer)
        results["relevancy"].append(relevancy)
    
    return {k: sum(v) / len(v) for k, v in results.items()}

Anti-Patterns

Anti-Pattern	Problem	Fix
Naive chunking	Fixed 500-char chunks split mid-sentence	Use semantic chunking with sentence boundaries
Top-K only	Retrieval returns 5 chunks, 3 are irrelevant	Add reranking layer, use hybrid search
No metadata filtering	Search returns docs from wrong department/date	Add metadata filters before vector search
Stuffing all context	All chunks stuffed into prompt regardless of relevance	Compress context, use map-reduce for long contexts
No evaluation	No way to know if retrieval quality is degrading	Build eval set with ground truth, measure weekly
Vector DB as the answer	Treating vector search as the only retrieval method	Hybrid search (dense + sparse + metadata)

Checklist

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For RAG architecture consulting, visit garnetgrid.com. :::