Vector Embeddings & Semantic Search

Traditional keyword search fails when users search for “cheap flights” but the content says “affordable airfare.” Semantic search understands meaning, not just words. It converts text into high-dimensional vectors (embeddings) and finds similar vectors — connecting intent to content regardless of exact wording.

How Embeddings Work

"How to deploy a Docker container"
         │ Embedding Model
         ▼
[0.023, -0.156, 0.891, 0.034, ..., -0.445]   (1536 dimensions)

"Steps to run a containerized application"
         │ Same Model
         ▼
[0.019, -0.148, 0.887, 0.041, ..., -0.439]   (similar vector!)

Cosine similarity: 0.94 (very similar meaning)

Embedding Models

Model	Dimensions	Speed	Quality	Cost
text-embedding-3-small (OpenAI)	1536	Fast	Good	$
text-embedding-3-large (OpenAI)	3072	Medium	Best	$$
multilingual-e5-large	1024	Fast	Good (multilingual)	Free (self-host)
BGE-large-en-v1.5	1024	Fast	Good	Free (self-host)
Cohere embed-v3	1024	Fast	Good	$
Voyage-3	1024	Fast	Excellent	$$

RAG Pipeline

User Query: "How do I handle database migrations in production?"
         │
         ▼
┌─────────────────┐
│ Embed Query      │  → [0.12, -0.34, 0.78, ...]
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Vector Search    │  Top 5 most similar documents
│ (Pinecone/       │  from your knowledge base
│  Weaviate/Chroma)│
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ LLM Generation  │  "Based on these docs, here's how
│ (GPT-4, Claude) │   to handle database migrations..."
│                  │
│ Context:         │
│ [retrieved docs] │
└─────────────────┘

Hybrid Search

# Combine keyword (BM25) + semantic (vector) search
from pinecone import Pinecone

# Semantic search results
semantic_results = index.query(
    vector=embed("database migration best practices"),
    top_k=20,
    include_metadata=True
)

# Keyword search results  
keyword_results = bm25_search("database migration production")

# Reciprocal Rank Fusion (RRF) to combine
def reciprocal_rank_fusion(results_lists, k=60):
    scores = {}
    for results in results_lists:
        for rank, result in enumerate(results):
            doc_id = result.id
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
    
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

final_results = reciprocal_rank_fusion([semantic_results, keyword_results])

Anti-Patterns

Anti-Pattern	Problem	Fix
Embedding entire documents	Context lost in averaging	Chunk documents (200-500 tokens per chunk)
No chunk overlap	Context split across chunk boundaries	10-20% overlap between consecutive chunks
Wrong embedding model	Poor retrieval quality	Benchmark models on your data, use MTEB leaderboard
Vector search only	Misses exact keyword matches	Hybrid search (vector + BM25)
No reranking	Top results not always most relevant	Rerank top-20 with cross-encoder
Stale embeddings	Content updated but embeddings not refreshed	Re-embed on content change

Checklist

Embedding model selected (benchmark on your data)
Chunking strategy: 200-500 tokens, 10-20% overlap
Vector database chosen (Pinecone, Weaviate, Chroma, pgvector)
Hybrid search: vector + keyword for best recall
Reranking: cross-encoder on top-K results
Metadata filtering: narrow search by category/date
Embedding refresh pipeline for updated content
Evaluation: retrieval quality metrics (recall@k, MRR)

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For AI/ML consulting, visit garnetgrid.com. :::

How Embeddings Work

Embedding Models

RAG Pipeline

Hybrid Search

Anti-Patterns

Checklist

More in AI & Machine Learning

Responsible AI: Bias Detection & Mitigation

Agentic AI: Orchestration Frameworks

AI Cost Optimization: GPU vs API vs Edge