Enterprise RAG Pipeline Design and Optimization

Retrieval-Augmented Generation is the most practical way to give LLMs access to your organization’s knowledge without fine-tuning. But naive RAG — embed documents, retrieve top-k, stuff into prompt — fails in production for predictable reasons: irrelevant retrieval, lost context from bad chunking, hallucination despite having the right documents, and latency that makes users give up.

Production RAG requires engineering at every stage: how you chunk, how you embed, how you retrieve, how you rerank, and how you present context to the model. Each decision compounds — a 10% improvement at each of 5 stages yields a 60% improvement end-to-end.

The RAG Pipeline Stages

Document → Chunk → Embed → Index → Query → Retrieve → Rerank → Generate → Validate

Each stage has failure modes that degrade the entire system:

Stage	Common Failure	Impact
Chunking	Splitting mid-sentence or mid-concept	Retrieved chunks are incoherent
Embedding	Wrong model for your domain	Semantic similarity doesn’t match relevance
Retrieval	Top-k too small or unfiltered	Missing relevant context
Reranking	No reranking step	Noise pollutes context window
Generation	Context window overflow	Model ignores relevant information

Semantic Chunking

The most underrated stage. Bad chunks doom everything downstream. Instead of fixed-size character splits, chunk at semantic boundaries:

class SemanticChunker:
    def __init__(self, embedding_model, similarity_threshold=0.75):
        self.embedding_model = embedding_model
        self.threshold = similarity_threshold
    
    def chunk(self, text: str) -> list[str]:
        sentences = self.split_sentences(text)
        embeddings = self.embedding_model.encode(sentences)
        
        chunks = []
        current_chunk = [sentences[0]]
        
        for i in range(1, len(sentences)):
            similarity = cosine_similarity(embeddings[i-1], embeddings[i])
            if similarity < self.threshold:
                chunks.append(" ".join(current_chunk))
                current_chunk = [sentences[i]]
            else:
                current_chunk.append(sentences[i])
        
        if current_chunk:
            chunks.append(" ".join(current_chunk))
        
        return chunks

For structured documents, preserve the hierarchy by creating parent chunks (section summaries) and child chunks (paragraphs), then including parent context when a child chunk is retrieved.

Hybrid Search with Reciprocal Rank Fusion

Pure vector search misses exact matches (product names, error codes, IDs). Pure keyword search misses semantic meaning. Combine both using Reciprocal Rank Fusion (RRF):

class HybridSearchPipeline:
    async def search(self, query: str, top_k: int = 20) -> list[Result]:
        vector_results, keyword_results = await asyncio.gather(
            self.vector_store.search(query, top_k=top_k),
            self.keyword_index.search(query, top_k=top_k)
        )
        return self.rrf_merge(vector_results, keyword_results)[:top_k]
    
    def rrf_merge(self, *result_lists, k=60):
        scores = defaultdict(float)
        for results in result_lists:
            for rank, result in enumerate(results):
                scores[result.id] += 1.0 / (k + rank + 1)
        return sorted(scores.items(), key=lambda x: x[1], reverse=True)

RRF is simple, needs no weight tuning, and consistently outperforms weighted linear combination on diverse query types.

Cross-Encoder Reranking

Retrieve broadly (top-50), then rerank precisely (top-5). The retrieval step optimizes for recall; the reranking step optimizes for precision. Cross-encoder rerankers evaluate query-document pairs jointly, yielding 15-30% better relevance over raw retrieval.

Context Window Optimization

The “Lost in the Middle” problem: LLMs attend less to information in the middle of long contexts. Place the most relevant documents at the beginning and end, not the middle. This reordering alone can improve answer quality by 10-15%.

Evaluation Framework

RAG evaluation must test each stage independently AND end-to-end:

Metric	What It Measures	Target
Recall@k	Does the correct doc appear in top-k?	≥ 0.90
NDCG@k	Are most relevant docs ranked highest?	≥ 0.85
Faithfulness	Are answers supported by retrieved docs?	≥ 0.95
Hallucination Rate	Claims not in retrieved context	≤ 5%

Build a ground-truth test set: 200+ question-answer-source triples. Run evaluations on every pipeline change. A 2% improvement in retrieval recall compounds into a 10%+ improvement in answer quality.

The RAG Pipeline Stages

Semantic Chunking

Hybrid Search with Reciprocal Rank Fusion

Cross-Encoder Reranking

Context Window Optimization

Evaluation Framework

More in AI Engineering

AI Agent Orchestration

AI Agent Tool Selection Optimization

AI Agent Architecture