Vector Embeddings & Semantic Search
Build semantic search systems with embeddings. Covers embedding models, vector databases, similarity search, hybrid search, RAG pipelines, and embedding optimization.
Traditional keyword search fails when users search for “cheap flights” but the content says “affordable airfare.” Semantic search understands meaning, not just words. It converts text into high-dimensional vectors (embeddings) and finds similar vectors — connecting intent to content regardless of exact wording.
How Embeddings Work
"How to deploy a Docker container"
│ Embedding Model
▼
[0.023, -0.156, 0.891, 0.034, ..., -0.445] (1536 dimensions)
"Steps to run a containerized application"
│ Same Model
▼
[0.019, -0.148, 0.887, 0.041, ..., -0.439] (similar vector!)
Cosine similarity: 0.94 (very similar meaning)
Embedding Models
| Model | Dimensions | Speed | Quality | Cost |
|---|---|---|---|---|
| text-embedding-3-small (OpenAI) | 1536 | Fast | Good | $ |
| text-embedding-3-large (OpenAI) | 3072 | Medium | Best | $$ |
| multilingual-e5-large | 1024 | Fast | Good (multilingual) | Free (self-host) |
| BGE-large-en-v1.5 | 1024 | Fast | Good | Free (self-host) |
| Cohere embed-v3 | 1024 | Fast | Good | $ |
| Voyage-3 | 1024 | Fast | Excellent | $$ |
RAG Pipeline
User Query: "How do I handle database migrations in production?"
│
▼
┌─────────────────┐
│ Embed Query │ → [0.12, -0.34, 0.78, ...]
└────────┬────────┘
│
▼
┌─────────────────┐
│ Vector Search │ Top 5 most similar documents
│ (Pinecone/ │ from your knowledge base
│ Weaviate/Chroma)│
└────────┬────────┘
│
▼
┌─────────────────┐
│ LLM Generation │ "Based on these docs, here's how
│ (GPT-4, Claude) │ to handle database migrations..."
│ │
│ Context: │
│ [retrieved docs] │
└─────────────────┘
Hybrid Search
# Combine keyword (BM25) + semantic (vector) search
from pinecone import Pinecone
# Semantic search results
semantic_results = index.query(
vector=embed("database migration best practices"),
top_k=20,
include_metadata=True
)
# Keyword search results
keyword_results = bm25_search("database migration production")
# Reciprocal Rank Fusion (RRF) to combine
def reciprocal_rank_fusion(results_lists, k=60):
scores = {}
for results in results_lists:
for rank, result in enumerate(results):
doc_id = result.id
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
return sorted(scores.items(), key=lambda x: x[1], reverse=True)
final_results = reciprocal_rank_fusion([semantic_results, keyword_results])
Anti-Patterns
| Anti-Pattern | Problem | Fix |
|---|---|---|
| Embedding entire documents | Context lost in averaging | Chunk documents (200-500 tokens per chunk) |
| No chunk overlap | Context split across chunk boundaries | 10-20% overlap between consecutive chunks |
| Wrong embedding model | Poor retrieval quality | Benchmark models on your data, use MTEB leaderboard |
| Vector search only | Misses exact keyword matches | Hybrid search (vector + BM25) |
| No reranking | Top results not always most relevant | Rerank top-20 with cross-encoder |
| Stale embeddings | Content updated but embeddings not refreshed | Re-embed on content change |
Checklist
- Embedding model selected (benchmark on your data)
- Chunking strategy: 200-500 tokens, 10-20% overlap
- Vector database chosen (Pinecone, Weaviate, Chroma, pgvector)
- Hybrid search: vector + keyword for best recall
- Reranking: cross-encoder on top-K results
- Metadata filtering: narrow search by category/date
- Embedding refresh pipeline for updated content
- Evaluation: retrieval quality metrics (recall@k, MRR)
:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For AI/ML consulting, visit garnetgrid.com. :::