ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

RAG Architecture Patterns

Design production-ready Retrieval-Augmented Generation systems. Covers chunking strategies, embedding models, vector search, reranking, context window optimization, hybrid search, evaluation frameworks, and the patterns that make RAG reliable.

Retrieval-Augmented Generation (RAG) combines the reasoning ability of LLMs with the factual accuracy of a knowledge base. Instead of relying on the LLM’s training data (which is stale and may hallucinate), RAG retrieves relevant documents and includes them in the prompt context. This grounds the LLM’s response in real, up-to-date information.


RAG Architecture

User Query

1. Query Processing
  - Query rewriting / expansion
  - Intent classification

2. Retrieval
  - Embed query → vector
  - Search vector store (ANN)
  - Keyword search (BM25)
  - Hybrid = vector + keyword

3. Reranking
  - Cross-encoder reranker
  - Score retrieved chunks by relevance
  - Select top-K most relevant

4. Context Assembly
  - Format chunks for prompt
  - Add system instructions
  - Respect context window limit

5. Generation
  - LLM generates response
  - Grounded in retrieved context

6. Post-Processing
  - Citation extraction
  - Hallucination detection
  - Response formatting

Chunking Strategies

# Fixed-size chunking (simple but breaks context)
def fixed_size_chunks(text, chunk_size=512, overlap=50):
    chunks = []
    for i in range(0, len(text), chunk_size - overlap):
        chunks.append(text[i:i + chunk_size])
    return chunks

# Semantic chunking (better context preservation)
def semantic_chunks(text, max_tokens=512):
    """Split on natural boundaries: paragraphs, sections, sentences."""
    sections = text.split('\n\n')
    chunks = []
    current_chunk = []
    current_size = 0
    
    for section in sections:
        section_tokens = count_tokens(section)
        if current_size + section_tokens > max_tokens and current_chunk:
            chunks.append('\n\n'.join(current_chunk))
            current_chunk = [section]
            current_size = section_tokens
        else:
            current_chunk.append(section)
            current_size += section_tokens
    
    if current_chunk:
        chunks.append('\n\n'.join(current_chunk))
    
    return chunks

# Document-aware chunking
# Split by: heading, page boundary, code block, table
# Preserve metadata: source file, section title, page number

from pinecone import Pinecone

# Combine dense (semantic) + sparse (keyword) search
pc = Pinecone()
index = pc.Index("knowledge-base")

# Dense vector from embedding model
dense_embedding = embedding_model.encode(query)

# Sparse vector from BM25
sparse_embedding = bm25_encoder.encode(query)

results = index.query(
    vector=dense_embedding,
    sparse_vector=sparse_embedding,
    top_k=20,
    include_metadata=True
)

# Rerank with cross-encoder for precision
reranked = reranker.rerank(
    query=query,
    documents=[r.metadata['text'] for r in results],
    top_n=5
)

Context Assembly

def assemble_context(query, retrieved_chunks, max_context_tokens=4000):
    """Build prompt with retrieved context."""
    
    system_prompt = """You are a helpful assistant. Answer the user's question 
    based ONLY on the provided context. If the context doesn't contain the 
    answer, say "I don't have enough information to answer that."
    
    Always cite your sources using [Source: filename] format."""
    
    context_parts = []
    token_count = count_tokens(system_prompt + query)
    
    for chunk in retrieved_chunks:
        chunk_tokens = count_tokens(chunk['text'])
        if token_count + chunk_tokens > max_context_tokens:
            break
        context_parts.append(
            f"[Source: {chunk['source']}]\n{chunk['text']}"
        )
        token_count += chunk_tokens
    
    context = "\n\n---\n\n".join(context_parts)
    
    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
    ]

Evaluation

MetricMeasuresHow
Retrieval Precision@KRelevant docs in top KHuman-labeled relevance
Retrieval RecallCoverage of relevant docsAll relevant docs found?
Answer FaithfulnessGrounded in context?LLM-as-judge vs context
Answer RelevanceActually answers question?LLM-as-judge vs query
Hallucination RateMade-up informationFact-check against sources

Anti-Patterns

Anti-PatternConsequenceFix
No rerankingIrrelevant context pollutes responseCross-encoder reranker after retrieval
Chunks too smallLose context, fragmented answers256-512 token chunks with overlap
Chunks too largeDilute relevance, waste context windowSplit on natural boundaries
No source citationCannot verify answersInclude source metadata in prompt
Dense search onlyMisses keyword-specific queriesHybrid search (dense + sparse)

RAG turns LLMs from unreliable knowledge sources into reliable answer engines — when the retrieval is good. The quality of a RAG system is 80% retrieval quality and 20% generation quality.

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →