RAG Architecture Patterns | The Garnet Wiki

Retrieval-Augmented Generation (RAG) combines the reasoning ability of LLMs with the factual accuracy of a knowledge base. Instead of relying on the LLM’s training data (which is stale and may hallucinate), RAG retrieves relevant documents and includes them in the prompt context. This grounds the LLM’s response in real, up-to-date information.

RAG Architecture

User Query
  ↓
1. Query Processing
  - Query rewriting / expansion
  - Intent classification
  ↓
2. Retrieval
  - Embed query → vector
  - Search vector store (ANN)
  - Keyword search (BM25)
  - Hybrid = vector + keyword
  ↓
3. Reranking
  - Cross-encoder reranker
  - Score retrieved chunks by relevance
  - Select top-K most relevant
  ↓
4. Context Assembly
  - Format chunks for prompt
  - Add system instructions
  - Respect context window limit
  ↓
5. Generation
  - LLM generates response
  - Grounded in retrieved context
  ↓
6. Post-Processing
  - Citation extraction
  - Hallucination detection
  - Response formatting

Chunking Strategies

# Fixed-size chunking (simple but breaks context)
def fixed_size_chunks(text, chunk_size=512, overlap=50):
    chunks = []
    for i in range(0, len(text), chunk_size - overlap):
        chunks.append(text[i:i + chunk_size])
    return chunks

# Semantic chunking (better context preservation)
def semantic_chunks(text, max_tokens=512):
    """Split on natural boundaries: paragraphs, sections, sentences."""
    sections = text.split('\n\n')
    chunks = []
    current_chunk = []
    current_size = 0
    
    for section in sections:
        section_tokens = count_tokens(section)
        if current_size + section_tokens > max_tokens and current_chunk:
            chunks.append('\n\n'.join(current_chunk))
            current_chunk = [section]
            current_size = section_tokens
        else:
            current_chunk.append(section)
            current_size += section_tokens
    
    if current_chunk:
        chunks.append('\n\n'.join(current_chunk))
    
    return chunks

# Document-aware chunking
# Split by: heading, page boundary, code block, table
# Preserve metadata: source file, section title, page number

Hybrid Search

from pinecone import Pinecone

# Combine dense (semantic) + sparse (keyword) search
pc = Pinecone()
index = pc.Index("knowledge-base")

# Dense vector from embedding model
dense_embedding = embedding_model.encode(query)

# Sparse vector from BM25
sparse_embedding = bm25_encoder.encode(query)

results = index.query(
    vector=dense_embedding,
    sparse_vector=sparse_embedding,
    top_k=20,
    include_metadata=True
)

# Rerank with cross-encoder for precision
reranked = reranker.rerank(
    query=query,
    documents=[r.metadata['text'] for r in results],
    top_n=5
)

Context Assembly

def assemble_context(query, retrieved_chunks, max_context_tokens=4000):
    """Build prompt with retrieved context."""
    
    system_prompt = """You are a helpful assistant. Answer the user's question 
    based ONLY on the provided context. If the context doesn't contain the 
    answer, say "I don't have enough information to answer that."
    
    Always cite your sources using [Source: filename] format."""
    
    context_parts = []
    token_count = count_tokens(system_prompt + query)
    
    for chunk in retrieved_chunks:
        chunk_tokens = count_tokens(chunk['text'])
        if token_count + chunk_tokens > max_context_tokens:
            break
        context_parts.append(
            f"[Source: {chunk['source']}]\n{chunk['text']}"
        )
        token_count += chunk_tokens
    
    context = "\n\n---\n\n".join(context_parts)
    
    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
    ]

Evaluation

Metric	Measures	How
Retrieval Precision@K	Relevant docs in top K	Human-labeled relevance
Retrieval Recall	Coverage of relevant docs	All relevant docs found?
Answer Faithfulness	Grounded in context?	LLM-as-judge vs context
Answer Relevance	Actually answers question?	LLM-as-judge vs query
Hallucination Rate	Made-up information	Fact-check against sources

Anti-Patterns

Anti-Pattern	Consequence	Fix
No reranking	Irrelevant context pollutes response	Cross-encoder reranker after retrieval
Chunks too small	Lose context, fragmented answers	256-512 token chunks with overlap
Chunks too large	Dilute relevance, waste context window	Split on natural boundaries
No source citation	Cannot verify answers	Include source metadata in prompt
Dense search only	Misses keyword-specific queries	Hybrid search (dense + sparse)

RAG turns LLMs from unreliable knowledge sources into reliable answer engines — when the retrieval is good. The quality of a RAG system is 80% retrieval quality and 20% generation quality.