Enterprise RAG Pipeline Design and Optimization
How to build production-grade RAG pipelines for enterprise knowledge systems. Covers chunking strategies, hybrid search, reranking, context management, and evaluation methodologies.
Retrieval-Augmented Generation is the most practical way to give LLMs access to your organization’s knowledge without fine-tuning. But naive RAG — embed documents, retrieve top-k, stuff into prompt — fails in production for predictable reasons: irrelevant retrieval, lost context from bad chunking, hallucination despite having the right documents, and latency that makes users give up.
Production RAG requires engineering at every stage: how you chunk, how you embed, how you retrieve, how you rerank, and how you present context to the model. Each decision compounds — a 10% improvement at each of 5 stages yields a 60% improvement end-to-end.
The RAG Pipeline Stages
Document → Chunk → Embed → Index → Query → Retrieve → Rerank → Generate → Validate
Each stage has failure modes that degrade the entire system:
| Stage | Common Failure | Impact |
|---|---|---|
| Chunking | Splitting mid-sentence or mid-concept | Retrieved chunks are incoherent |
| Embedding | Wrong model for your domain | Semantic similarity doesn’t match relevance |
| Retrieval | Top-k too small or unfiltered | Missing relevant context |
| Reranking | No reranking step | Noise pollutes context window |
| Generation | Context window overflow | Model ignores relevant information |
Semantic Chunking
The most underrated stage. Bad chunks doom everything downstream. Instead of fixed-size character splits, chunk at semantic boundaries:
class SemanticChunker:
def __init__(self, embedding_model, similarity_threshold=0.75):
self.embedding_model = embedding_model
self.threshold = similarity_threshold
def chunk(self, text: str) -> list[str]:
sentences = self.split_sentences(text)
embeddings = self.embedding_model.encode(sentences)
chunks = []
current_chunk = [sentences[0]]
for i in range(1, len(sentences)):
similarity = cosine_similarity(embeddings[i-1], embeddings[i])
if similarity < self.threshold:
chunks.append(" ".join(current_chunk))
current_chunk = [sentences[i]]
else:
current_chunk.append(sentences[i])
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
For structured documents, preserve the hierarchy by creating parent chunks (section summaries) and child chunks (paragraphs), then including parent context when a child chunk is retrieved.
Hybrid Search with Reciprocal Rank Fusion
Pure vector search misses exact matches (product names, error codes, IDs). Pure keyword search misses semantic meaning. Combine both using Reciprocal Rank Fusion (RRF):
class HybridSearchPipeline:
async def search(self, query: str, top_k: int = 20) -> list[Result]:
vector_results, keyword_results = await asyncio.gather(
self.vector_store.search(query, top_k=top_k),
self.keyword_index.search(query, top_k=top_k)
)
return self.rrf_merge(vector_results, keyword_results)[:top_k]
def rrf_merge(self, *result_lists, k=60):
scores = defaultdict(float)
for results in result_lists:
for rank, result in enumerate(results):
scores[result.id] += 1.0 / (k + rank + 1)
return sorted(scores.items(), key=lambda x: x[1], reverse=True)
RRF is simple, needs no weight tuning, and consistently outperforms weighted linear combination on diverse query types.
Cross-Encoder Reranking
Retrieve broadly (top-50), then rerank precisely (top-5). The retrieval step optimizes for recall; the reranking step optimizes for precision. Cross-encoder rerankers evaluate query-document pairs jointly, yielding 15-30% better relevance over raw retrieval.
Context Window Optimization
The “Lost in the Middle” problem: LLMs attend less to information in the middle of long contexts. Place the most relevant documents at the beginning and end, not the middle. This reordering alone can improve answer quality by 10-15%.
Evaluation Framework
RAG evaluation must test each stage independently AND end-to-end:
| Metric | What It Measures | Target |
|---|---|---|
| Recall@k | Does the correct doc appear in top-k? | ≥ 0.90 |
| NDCG@k | Are most relevant docs ranked highest? | ≥ 0.85 |
| Faithfulness | Are answers supported by retrieved docs? | ≥ 0.95 |
| Hallucination Rate | Claims not in retrieved context | ≤ 5% |
Build a ground-truth test set: 200+ question-answer-source triples. Run evaluations on every pipeline change. A 2% improvement in retrieval recall compounds into a 10%+ improvement in answer quality.