Verified by Garnet Grid

RAG Architecture Patterns: When Vector Search Is Not Enough

Design Retrieval-Augmented Generation systems that actually work in production. Covers chunking strategies, embedding models, hybrid search, reranking, evaluation metrics, and the failure modes that textbook RAG implementations ignore.

RAG — Retrieval-Augmented Generation — has become the default answer to “how do I make an LLM use my data.” The basic idea is simple: search a knowledge base, stuff the relevant chunks into the prompt, and let the model answer. Tutorials make this look like 20 lines of code.

In production, it is 20 lines of code plus 6 months of debugging why the system confidently gives wrong answers, misses relevant documents, hallucinates citations, and costs $50,000/month in API calls.

This guide covers the architecture decisions that separate demo-quality RAG from production-quality RAG.


The Naive RAG Pipeline (And Why It Fails)

Document → Chunk into 500 tokens → Embed with OpenAI → Store in vector DB
Query → Embed → Find top-5 similar chunks → Stuff into prompt → LLM answers

This works for demos. Here is why it breaks in production:
Failure ModeWhy It HappensHow Often
Retrieves wrong chunksEmbedding similarity ≠ semantic relevanceVery common
Misses relevant infoAnswer spans multiple chunks, none retrievedCommon
Hallucinates citationsModel references chunks that do not support its answerCommon
Exceeds context windowToo many chunks + long query = truncationModerate
Slow retrievalLarge vector index + no filtering = high latencyAt scale
Stale dataDocuments updated but embeddings not re-indexedCommon

Chunking: The Foundation Everyone Gets Wrong

How you split documents determines everything downstream. Bad chunking = bad retrieval = bad answers.

Chunking Strategies

StrategyHow It WorksBest For
Fixed-sizeSplit every N tokens with overlapSimplest; reasonable baseline
Sentence-basedSplit at sentence boundariesPreserving complete thoughts
Paragraph-basedSplit at paragraph boundariesStructured documents
SemanticSplit when topic changes (embedding similarity)Long documents with varied topics
RecursiveTry paragraph → sentence → fixed as fallbackGeneral purpose
Document-awareUse document structure (headers, sections)Technical docs, wikis, manuals

Practical Chunking Configuration

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Production-tested defaults:
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,        # tokens, not characters
    chunk_overlap=50,      # ~10% overlap prevents splitting mid-thought
    separators=[
        "\n## ",           # Markdown H2 headers
        "\n### ",          # Markdown H3 headers
        "\n\n",            # Paragraph breaks
        "\n",              # Line breaks
        ". ",              # Sentences
        " ",               # Words (last resort)
    ],
    length_function=len,   # Replace with tiktoken for token counting
)

The chunk size tradeoff:

  • Smaller chunks (256 tokens): More precise retrieval, but answers may need information from multiple chunks. Increases retrieval complexity.
  • Larger chunks (1024 tokens): More context per chunk, but retrieval is less precise and you fit fewer chunks in the prompt.
  • Sweet spot for most cases: 400-600 tokens with 10% overlap.

Beyond Naive Vector Search: Hybrid Retrieval

Pure vector search compares embedding similarity, which captures semantic meaning but misses exact matches. Keyword search finds exact terms but misses synonyms and paraphrases. Production RAG uses both.

Hybrid Search Architecture

Query: "What is the SLA for our payment API?"

Vector Search (semantic):
  → Finds chunks about "service level agreements" and "uptime guarantees"
  → Misses: chunks that literally say "payment API SLA: 99.95%"

Keyword Search (BM25):
  → Finds chunks containing "SLA" and "payment API"
  → Misses: chunks about "availability commitments" (same concept, different words)

Hybrid Search (both):
  → Combines results using Reciprocal Rank Fusion (RRF)
  → Gets both semantic matches AND exact keyword matches
  → Answer: "The payment API SLA is 99.95% availability..."
# Reciprocal Rank Fusion: combine results from multiple search methods
def reciprocal_rank_fusion(result_lists: list[list], k: int = 60) -> list:
    """
    Merge multiple ranked result lists using RRF.
    k=60 is standard; higher k = less weight on top results.
    """
    scores = {}
    for result_list in result_lists:
        for rank, doc_id in enumerate(result_list):
            if doc_id not in scores:
                scores[doc_id] = 0
            scores[doc_id] += 1 / (rank + k)

    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

Reranking: The Step That Doubles Accuracy

After retrieval, you have 10-20 candidate chunks. Not all of them are actually relevant. A reranker model scores each chunk against the original query and re-orders them by actual relevance.

Without reranking:
  Top 5 chunks by embedding similarity → Some are relevant, some are noise
  LLM uses all 5 → Answer includes irrelevant information

With reranking:
  Top 20 chunks by embedding similarity (cast a wide net)
  → Reranker scores each against the query
  → Top 5 by reranker score → All highly relevant
  → LLM uses 5 high-quality chunks → Better answer
RerankerSpeedQualityCost
Cohere RerankFast (API)High$1/1K queries
BGE Reranker v2Medium (local)HighFree (self-hosted)
Cross-encoder (ms-marco)Slow (local)HighestFree (self-hosted)
LLM-as-reranker (GPT-4)Slow (API)Very highExpensive

Metadata Filtering: The Overlooked Optimization

Vector search alone searches everything. Adding metadata filters narrows the search space before similarity comparison, which is both faster and more accurate.

# Without metadata: search all 500K chunks
results = index.query(
    vector=query_embedding,
    top_k=10
)

# With metadata: search only relevant subset
results = index.query(
    vector=query_embedding,
    top_k=10,
    filter={
        "source": "engineering-handbook",
        "category": "infrastructure",
        "last_updated": {"$gte": "2024-01-01"}
    }
)

Essential Metadata Fields

FieldWhyExample
sourceFilter by document source”confluence”, “github”, “notion”
categoryTopic-based filtering”security”, “infrastructure”, “api”
last_updatedFreshness filtering”2024-07-15”
document_idGroup chunks from same doc”doc_abc123”
chunk_indexRetrieve surrounding context0, 1, 2, 3…
access_levelPermission-based retrieval”public”, “internal”, “confidential”

Evaluation: How to Know If Your RAG Works

MetricWhat It MeasuresHow to Calculate
Retrieval PrecisionAre the retrieved chunks relevant?Relevant chunks / total retrieved
Retrieval RecallDid we find all relevant chunks?Retrieved relevant / total relevant
FaithfulnessDoes the answer match the retrieved context?LLM judge or human eval
Answer RelevanceDoes the answer address the question?LLM judge or human eval
LatencyHow long does end-to-end take?Timer from query to response
Cost per queryHow much does each query cost?API costs (embedding + LLM + reranking)

The minimum viable evaluation set: 50 question-answer pairs with labeled relevant documents. Run retrieval against this set after every pipeline change. If retrieval precision drops, your answer quality will drop — regardless of what the LLM does.


Implementation Checklist

  • Choose chunking strategy based on your document types (start with recursive, 512 tokens, 10% overlap)
  • Implement hybrid search: vector (semantic) + keyword (BM25) with reciprocal rank fusion
  • Add a reranking step after retrieval to filter noise from top results
  • Attach metadata to every chunk: source, category, date, document ID, chunk index
  • Build an evaluation set of 50+ QA pairs with labeled relevant documents
  • Implement parent document retrieval: when a chunk matches, include surrounding chunks
  • Add freshness filters: exclude chunks from outdated documents
  • Monitor cost per query and set budget alerts
  • Track retrieval latency: initial target < 500ms for retrieval, < 3s end-to-end
  • Set up automated re-indexing when source documents change
Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →