Verified by Garnet Grid

RAG Architecture: Beyond Basic Retrieval

Build production-grade RAG systems. Covers chunking strategies, embedding models, hybrid search, reranking, query transformation, evaluation, and advanced patterns for enterprise retrieval-augmented generation.

Basic RAG — embed documents, throw them in a vector database, retrieve top-K chunks, paste them into a prompt — works for demos. It fails in production. Queries are ambiguous. Chunks lose context. Embeddings miss keywords. The LLM hallucinates despite having the right context. The retrieval quality degrades silently. Enterprise RAG requires engineering at every layer: chunking, embedding, retrieval, reranking, prompt construction, and evaluation.

This guide covers the architecture of production-grade RAG systems, with the engineering techniques that separate a toy prototype from a system that actually helps users find accurate answers.


RAG Architecture Layers

User Query

┌── Query Processing ──────────┐
│ • Query expansion             │
│ • Intent classification       │
│ • Decomposition (multi-step)  │
└──────────────────────────────┘

┌── Retrieval ─────────────────┐
│ • Dense (semantic) search     │
│ • Sparse (keyword) search     │
│ • Hybrid (dense + sparse)     │
│ • Metadata filtering          │
└──────────────────────────────┘

┌── Post-Retrieval ────────────┐
│ • Reranking                   │
│ • Deduplication               │
│ • Context compression         │
└──────────────────────────────┘

┌── Generation ────────────────┐
│ • Prompt construction         │
│ • Citation management         │
│ • Grounding validation        │
└──────────────────────────────┘

Answer (with citations)

Chunking Strategies

The way you split documents determines retrieval quality more than any other decision:

StrategyBest ForChunk SizeOverlap
Fixed-sizeConsistent-format docs500-1000 tokens10-20%
Semantic (header-based)Structured docs with headingsVariableNone
RecursiveMixed-format content500-1500 tokens50-100 tokens
Sentence-windowDocuments requiring precise retrieval1-3 sentences (with surrounding context stored)N/A
Parent-childDocuments with hierarchiesSmall (retrieval) → Large (context)N/A

Parent-Child Chunking

Retrieve on small, precise chunks but pass larger parent chunks to the LLM:

def create_parent_child_chunks(document, small_size=200, large_size=1000):
    """Create small chunks for retrieval, linked to large chunks for context."""
    
    # Create parent (large) chunks
    parents = split_text(document, chunk_size=large_size)
    
    all_chunks = []
    for parent_idx, parent in enumerate(parents):
        parent_id = f"parent_{parent_idx}"
        
        # Create child (small) chunks within each parent
        children = split_text(parent, chunk_size=small_size)
        
        for child_idx, child in enumerate(children):
            all_chunks.append({
                "id": f"child_{parent_idx}_{child_idx}",
                "text": child,           # Small: for embedding + retrieval
                "parent_id": parent_id,
                "parent_text": parent,    # Large: for LLM context
                "metadata": {
                    "source": document.source,
                    "page": document.page,
                }
            })
    
    return all_chunks

# At retrieval time:
# 1. Search child chunks (small, precise)
# 2. Return parent chunks (large, full context) to LLM

Combine dense (semantic) and sparse (keyword) search:

def hybrid_search(query, collection, alpha=0.7):
    """Combine dense and sparse retrieval with weighted scoring."""
    
    # Dense search (semantic similarity)
    dense_results = collection.search(
        query_embedding=embed(query),
        top_k=20,
    )
    
    # Sparse search (BM25 keyword matching)
    sparse_results = collection.bm25_search(
        query=query,
        top_k=20,
    )
    
    # Reciprocal Rank Fusion (RRF) for combining
    combined = {}
    for rank, result in enumerate(dense_results):
        combined[result.id] = combined.get(result.id, 0) + alpha / (rank + 60)
    
    for rank, result in enumerate(sparse_results):
        combined[result.id] = combined.get(result.id, 0) + (1 - alpha) / (rank + 60)
    
    # Sort by combined score
    ranked = sorted(combined.items(), key=lambda x: x[1], reverse=True)
    return ranked[:10]
When To Lean DenseWhen To Lean Sparse
Semantic/conceptual queriesExact term matching (product codes, names)
“How does X compare to Y?""ERROR-5042 resolution”
Natural language questionsKeyword-heavy technical queries

Reranking

Initial retrieval casts a wide net. Reranking picks the most relevant results:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")

def rerank_results(query, search_results, top_n=5):
    """Cross-encoder reranking for higher precision."""
    
    pairs = [(query, result["text"]) for result in search_results]
    scores = reranker.predict(pairs)
    
    ranked = sorted(
        zip(search_results, scores),
        key=lambda x: x[1],
        reverse=True,
    )
    
    return [result for result, score in ranked[:top_n]]

Impact: Reranking typically improves precision by 10-25% over embedding-only retrieval.


Query Processing

Query Decomposition

Complex queries should be broken into sub-queries:

def decompose_query(query):
    prompt = f"""Break this complex question into simpler sub-questions that
    can each be answered independently, then combined.
    
    Question: {query}
    
    Sub-questions (JSON array):"""
    
    sub_queries = json.loads(llm.generate(prompt, temperature=0))
    return sub_queries

# "How does our Q3 revenue compare to Q2, and what drove the difference?"
# → ["What was Q3 revenue?", "What was Q2 revenue?", 
#    "What factors changed between Q2 and Q3?"]

HyDE (Hypothetical Document Embeddings)

Generate a hypothetical answer, embed that, and search for similar real documents:

def hyde_search(query, collection):
    """Generate hypothetical answer, use it for retrieval."""
    hypothetical = llm.generate(
        f"Write a short, factual answer to: {query}",
        temperature=0.5,
    )
    
    # Embed the hypothetical answer (not the query)
    results = collection.search(
        query_embedding=embed(hypothetical),
        top_k=10,
    )
    
    return results

RAG Evaluation

MetricWhat It MeasuresTarget
Context Precision% of retrieved chunks that are relevant> 70%
Context Recall% of relevant chunks that were retrieved> 80%
Answer FaithfulnessIs the answer grounded in retrieved context?> 90%
Answer RelevancyDoes the answer address the question?> 85%
Hallucination Rate% of claims not supported by context< 5%
def evaluate_rag(test_set, rag_pipeline):
    results = {"faithfulness": [], "relevancy": [], "hallucinations": []}
    
    for test in test_set:
        answer, sources = rag_pipeline(test["query"])
        
        # Check faithfulness
        claims = extract_claims(answer)
        grounded = sum(1 for c in claims if is_grounded(c, sources))
        results["faithfulness"].append(grounded / len(claims) if claims else 1.0)
        
        # Check relevancy
        relevancy = judge_relevancy(test["query"], answer)
        results["relevancy"].append(relevancy)
    
    return {k: sum(v) / len(v) for k, v in results.items()}

Anti-Patterns

Anti-PatternProblemFix
Naive chunkingFixed 500-char chunks split mid-sentenceUse semantic chunking with sentence boundaries
Top-K onlyRetrieval returns 5 chunks, 3 are irrelevantAdd reranking layer, use hybrid search
No metadata filteringSearch returns docs from wrong department/dateAdd metadata filters before vector search
Stuffing all contextAll chunks stuffed into prompt regardless of relevanceCompress context, use map-reduce for long contexts
No evaluationNo way to know if retrieval quality is degradingBuild eval set with ground truth, measure weekly
Vector DB as the answerTreating vector search as the only retrieval methodHybrid search (dense + sparse + metadata)

Checklist

  • Chunking strategy selected and tested (parent-child, semantic, sliding window)
  • Embedding model benchmarked on domain-specific queries
  • Hybrid search configured (dense + sparse + metadata filters)
  • Reranking layer added with cross-encoder
  • Query processing: decomposition, expansion, HyDE where appropriate
  • Context window management: compression for long contexts
  • Citation management: sources tracked through pipeline
  • Evaluation dataset built (50+ query-answer pairs with ground truth)
  • Faithfulness and relevancy metrics tracked
  • Hallucination detection pipeline in place
  • Monitoring: retrieval quality, latency, cost dashboards
  • Re-indexing pipeline automated for updated source documents

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For RAG architecture consulting, visit garnetgrid.com. :::

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →