Vector Databases
Understand vector databases and how they power semantic search, recommendation engines, and AI applications. Covers embedding generation, similarity search, indexing algorithms, hybrid search, and the patterns for building production vector search systems.
Vector databases store and search high-dimensional vectors — mathematical representations of data like text, images, and audio. Instead of keyword matching, vector search finds semantically similar items: “running shoes” matches “athletic sneakers” even though they share no keywords. This is the foundation of semantic search, RAG systems, and recommendation engines.
How Vector Search Works
Traditional search:
Query: "running shoes"
Match: Documents containing the exact words "running" AND "shoes"
Miss: "athletic sneakers", "jogging footwear", "nike air max"
Vector search:
Query: "running shoes" → embed → [0.23, -0.45, 0.67, ...]
Search: Find nearest vectors in database
Match: "athletic sneakers" (similarity: 0.95)
"jogging footwear" (similarity: 0.91)
"nike air max running" (similarity: 0.89)
"hiking boots" (similarity: 0.42) ← correctly low
Embedding Generation
# Text embeddings with OpenAI
from openai import OpenAI
client = OpenAI()
response = client.embeddings.create(
model="text-embedding-3-small",
input="Running shoes for marathon training",
)
embedding = response.data[0].embedding # 1536-dimensional vector
# Sentence Transformers (open source)
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode([
"Running shoes for marathon training",
"Athletic sneakers for jogging",
"Formal leather dress shoes",
])
# Shape: (3, 384)
# Similarity
from sklearn.metrics.pairwise import cosine_similarity
sims = cosine_similarity(embeddings)
# [running, sneakers] = 0.87 (similar)
# [running, formal] = 0.23 (dissimilar)
Vector Database Comparison
| Database | Type | Best For | Hosting |
|---|---|---|---|
| Pinecone | Purpose-built | Production, managed, fast | Fully managed |
| pgvector | PostgreSQL extension | Postgres shops, small scale | Self-hosted or cloud Postgres |
| Weaviate | Purpose-built | Multi-modal, built-in ML | Managed or self-hosted |
| Qdrant | Purpose-built | Performance, Rust-based | Managed or self-hosted |
| ChromaDB | Embedded | Prototyping, small datasets | In-process |
| Milvus | Purpose-built | Large scale, enterprise | Self-hosted |
Pinecone Usage
from pinecone import Pinecone
pc = Pinecone(api_key="your-api-key")
index = pc.Index("products")
# Upsert vectors
index.upsert(vectors=[
{
"id": "shoe-001",
"values": embed("Running shoes for marathon"),
"metadata": {
"category": "footwear",
"price": 129.99,
"brand": "Nike",
}
},
# ... more vectors
])
# Query with metadata filtering
results = index.query(
vector=embed("shoes for running"),
top_k=10,
filter={
"category": {"$eq": "footwear"},
"price": {"$lte": 150.00},
},
include_metadata=True,
)
for match in results.matches:
print(f"{match.id}: {match.score:.3f} - {match.metadata}")
Hybrid Search
Vector-only search:
Pro: Semantic understanding
Con: Misses exact keyword matches ("SKU-12345")
Keyword-only search:
Pro: Exact matching, well-understood
Con: Misses semantic similarity
Hybrid search:
Combine both: vector similarity + keyword matching
Score = α * vector_score + (1-α) * keyword_score
Implementation:
1. Sparse embedding (BM25/TF-IDF) for keywords
2. Dense embedding for semantics
3. Combine scores with weighted fusion
4. Or: separate queries, reciprocal rank fusion
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| Wrong embedding model | Poor similarity results | Match model to your domain and data type |
| No metadata filtering | Slow search over entire index | Pre-filter with metadata before vector search |
| Embedding everything unstructured | Noise in results | Chunk text properly, clean data |
| No evaluation metrics | Cannot measure search quality | Precision@k, recall@k, NDCG |
| Single vector per document | Miss multi-faceted content | Multiple embeddings or chunk-level vectors |
Vector databases are the infrastructure layer for AI-native applications. They enable the shift from “search by keywords” to “search by meaning.”