Vector Database Engineering
Build and operate vector databases for similarity search at scale. Covers embedding storage, approximate nearest neighbor algorithms, index types, hybrid search, vector database selection, and the patterns that make semantic search fast and accurate.
Vector databases store high-dimensional vectors (embeddings) and enable similarity search — finding items that are semantically similar rather than exactly matching. They power recommendation engines, semantic search, RAG systems, and image retrieval. As AI applications explode, understanding vector database internals separates working demos from production systems.
How Vector Search Works
Traditional Database: Vector Database:
Query: SELECT * FROM users Query: Find 10 items most similar
WHERE name = 'Alice' to this embedding vector
Method: Exact match (B-tree) Method: Approximate Nearest Neighbor
Result: Exact rows matching Result: Top-K closest vectors
Key insight:
Traditional: "Find rows where column = value" → Exact
Vector: "Find rows closest to this point in 768-dimensional space" → Approximate
Index Types
Flat (Brute Force):
How: Compare query against every vector
Speed: O(N) — slow for large datasets
Recall: 100% — perfect accuracy
Use: Small datasets (<100K vectors)
IVF (Inverted File Index):
How: Cluster vectors, search only nearest clusters
Speed: O(N/k) — k = number of clusters searched
Recall: 95-99%
Use: Medium datasets (100K-10M vectors)
HNSW (Hierarchical Navigable Small World):
How: Multi-layer graph with skip connections
Speed: O(log N) — very fast
Recall: 95-99%
Use: Large datasets, low latency requirements
Trade-off: High memory usage (graph stored in RAM)
PQ (Product Quantization):
How: Compress vectors into compact codes
Speed: Fast (operates on compressed data)
Recall: 90-95%
Use: Very large datasets with memory constraints
Trade-off: Some accuracy loss from compression
Implementation
from pinecone import Pinecone
# Initialize client
pc = Pinecone(api_key="your-api-key")
# Create index for semantic search
index = pc.create_index(
name="knowledge-base",
dimension=1536, # OpenAI ada-002 embedding size
metric="cosine", # cosine, euclidean, or dotproduct
spec={
"serverless": {
"cloud": "aws",
"region": "us-east-1",
}
}
)
# Upsert vectors with metadata
index = pc.Index("knowledge-base")
index.upsert(vectors=[
{
"id": "doc-001",
"values": embed("How to configure Kubernetes ingress"),
"metadata": {
"source": "docs",
"category": "kubernetes",
"updated": "2024-03-15",
}
},
{
"id": "doc-002",
"values": embed("Setting up PostgreSQL replication"),
"metadata": {
"source": "docs",
"category": "database",
"updated": "2024-03-10",
}
},
])
# Query: Find semantically similar documents
results = index.query(
vector=embed("How do I set up load balancing in K8s?"),
top_k=5,
include_metadata=True,
filter={
"category": {"$eq": "kubernetes"},
"updated": {"$gte": "2024-01-01"},
}
)
# Results ranked by cosine similarity
for match in results.matches:
print(f"{match.id}: {match.score:.4f}")
print(f" Category: {match.metadata['category']}")
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| Wrong distance metric | Poor search quality (cosine vs euclidean) | Match metric to embedding model recommendation |
| Too few dimensions | Lose semantic information | Use full dimension from embedding model |
| No metadata filters | Search entire index for every query | Pre-filter by metadata before vector search |
| Single embedding model for everything | Different content types need different models | Specialized models per content type |
| No evaluation of search quality | Cannot measure if results are relevant | Track precision@k, NDCG, user click-through |
Vector databases are the bridge between human-readable content and machine-understandable representations. Choose the right index type, match the distance metric to your embedding model, and always evaluate search quality with real queries.