Vector Databases: Architecture & Selection Guide
Understand vector database internals and choose the right one. Covers embedding storage, ANN algorithms, and comparisons of Pinecone, Weaviate, Qdrant, Milvus, and pgvector.
Vector databases store and search high-dimensional embeddings — the numerical representations that AI models create from text, images, audio, and other unstructured data. They power semantic search, recommendation engines, RAG (Retrieval-Augmented Generation) systems, anomaly detection, and de-duplication. As AI adoption accelerates, vector databases have become critical infrastructure.
This guide covers how vector search works internally, algorithm trade-offs, database selection criteria, and production best practices that prevent performance and cost surprises.
How Vector Search Works
Step 1: Generate Embeddings
An embedding model converts unstructured data (text, images) into fixed-length numerical vectors. Similar content produces similar vectors, enabling mathematical similarity comparison.
from openai import OpenAI
client = OpenAI()
response = client.embeddings.create(
input="How do I optimize PostgreSQL for high-traffic applications?",
model="text-embedding-3-small"
)
embedding = response.data[0].embedding # [0.012, -0.045, ...] (1536 floats)
Step 2: Choose a Distance Metric
The distance metric determines how “similarity” is calculated between vectors.
| Metric | Best For | How It Works |
|---|---|---|
| Cosine Similarity | Text embeddings (normalized vectors) | Measures angle between vectors — 1.0 = identical, 0.0 = unrelated |
| Euclidean (L2) | Image embeddings, spatial data | Measures straight-line distance — 0.0 = identical, higher = farther |
| Dot Product | When magnitude matters (popularity, relevance scoring) | Combines direction and magnitude — higher = more similar |
Rule of thumb: Use cosine similarity for text-based applications. Most embedding models produce normalized vectors where cosine similarity and dot product are equivalent.
ANN Algorithms Deep Dive
Exact nearest-neighbor search (brute force) is O(n) — too slow at scale. Approximate Nearest Neighbor (ANN) algorithms trade small accuracy losses for massive speed gains.
HNSW (Hierarchical Navigable Small World)
The most popular algorithm. HNSW builds a multi-layer graph (like a skip list) where higher layers contain fewer nodes for fast traversal, and lower layers contain all nodes for precision.
Key Parameters:
| Parameter | What It Controls | Recommended Range | Trade-off |
|---|---|---|---|
M | Max connections per node | 16-64 | Higher = better recall, more memory |
ef_construction | Build-time search width | 100-400 | Higher = better graph quality, slower build |
ef_search | Query-time search width | 50-200 | Higher = better recall, slower queries |
When to use: Most production workloads. Best balance of speed, accuracy, and memory.
IVF (Inverted File Index)
Clusters vectors into partitions (Voronoi cells). At query time, searches only the closest clusters instead of the full dataset.
Key Parameters:
| Parameter | What It Controls | Recommended |
|---|---|---|
nlist | Number of clusters | √(n) to 4×√(n) |
nprobe | Clusters searched per query | 5-20% of nlist |
When to use: Very large datasets where memory is constrained. Often combined with Product Quantization (PQ) for compression.
Algorithm Comparison
| Algorithm | Speed | Memory | Accuracy | Build Time |
|---|---|---|---|---|
| HNSW | Fastest (sub-ms) | High (full vectors in memory) | Highest (99%+ recall) | Slow |
| IVF-PQ | Fast | Low (compressed vectors) | Good (90-95% recall) | Medium |
| IVF-Flat | Medium | Medium | Very Good (97%+ recall) | Fast |
| Flat (exact) | Slow (O(n)) | Low | Perfect (100%) | None |
| ScaNN | Very Fast | Medium | Very High | Medium |
Database Comparison
| Feature | Pinecone | Weaviate | Qdrant | Milvus | pgvector |
|---|---|---|---|---|---|
| Hosting | Fully managed | Both (cloud + self-hosted) | Both | Self-hosted (Zilliz Cloud) | PostgreSQL extension |
| Algorithm | Proprietary (optimized) | HNSW | HNSW | HNSW + IVF + ScaNN | HNSW + IVF |
| Hybrid search | Yes (dense + sparse) | Yes (BM25 + vector) | Yes (sparse + dense) | Yes | With tsvector |
| Max scale | Billions (serverless) | Billions | Billions | Billions | Tens of millions |
| Filtering | Metadata filtering (pre/post) | GraphQL-style filters | Payload filtering | Attribute filtering | Standard SQL WHERE |
| Multi-tenancy | Namespaces | Multi-tenant classes | Collection-level | Partitions | Schema-level |
| Best for | Simplicity, serverless | Multi-modal (text + images) | Raw performance | Large-scale ML pipelines | Existing PostgreSQL installations |
| Pricing | Pay-per-query (serverless) | Open source + cloud | Open source + cloud | Open source + Zilliz | Free (PostgreSQL) |
Implementation: Pinecone (Managed)
from pinecone import Pinecone
pc = Pinecone(api_key="your-api-key")
index = pc.Index("knowledge-base")
# Upsert documents with metadata for filtering
index.upsert(vectors=[{
"id": "doc-001",
"values": embedding_vector,
"metadata": {
"source": "guide.pdf",
"category": "cloud",
"text": "How to optimize cloud costs...",
"word_count": 1500,
"published": "2025-01-15"
}
}])
# Semantic search with metadata filter
results = index.query(
vector=query_embedding,
top_k=10,
filter={"category": {"$eq": "cloud"}},
include_metadata=True
)
for match in results['matches']:
print(f"Score: {match['score']:.3f} | {match['metadata']['text'][:80]}")
Implementation: pgvector (Self-Hosted)
If you already run PostgreSQL, pgvector avoids adding a new database to your infrastructure.
-- Enable the extension
CREATE EXTENSION vector;
-- Create table with vector column
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
title TEXT,
content TEXT,
category TEXT,
embedding vector(1536),
created_at TIMESTAMP DEFAULT NOW()
);
-- Create HNSW index for fast similarity search
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 200);
-- Semantic search with SQL filtering
SELECT id, title,
1 - (embedding <=> $1::vector) AS similarity
FROM documents
WHERE category = 'cloud'
ORDER BY embedding <=> $1::vector
LIMIT 10;
pgvector Limitations
pgvector works well up to 5-10 million vectors. Beyond that, query latency increases and memory requirements grow quickly. If you expect to exceed this scale, plan a migration path to a dedicated vector database.
Selection Decision Tree
Need managed, serverless, minimal ops? → Pinecone
Already using PostgreSQL and < 5M vectors? → pgvector
Need multi-modal (text + images + video)? → Weaviate
Need max query performance, self-hosted? → Qdrant or Milvus
Prototyping / local development? → Chroma or LanceDB
Need hybrid search (keyword + semantic)? → Pinecone, Weaviate, or Qdrant
Budget constrained, open-source preferred? → Qdrant or pgvector
Production Best Practices
Chunking Strategy
Chunking (splitting documents into smaller pieces before embedding) directly impacts retrieval quality. Too large and the embedding averages out important details. Too small and context is lost.
# Recommended: ~500 tokens with 50-token overlap
# Overlap prevents information loss at chunk boundaries
chunks = split_text(document, chunk_size=500, chunk_overlap=50)
| Content Type | Recommended Chunk Size | Overlap |
|---|---|---|
| Technical documentation | 400-600 tokens | 50-100 tokens |
| Legal contracts | 200-400 tokens | 50 tokens |
| Code files | Per function/class | 0 (natural boundaries) |
| FAQ/knowledge base | Per question-answer pair | 0 |
Embedding Model Selection
| Model | Dimensions | Cost | Quality |
|---|---|---|---|
| text-embedding-3-small (OpenAI) | 1536 | $0.02/M tokens | Good |
| text-embedding-3-large (OpenAI) | 3072 | $0.13/M tokens | Better |
| Cohere embed-v3 | 1024 | $0.10/M tokens | Very Good |
| BGE-M3 (open source) | 1024 | Free (self-hosted) | Good |
| Voyage AI voyage-3 | 1024 | $0.06/M tokens | Excellent |
Cost Control
- Dimensionality reduction — OpenAI’s embedding-3 models support
dimensionsparameter to reduce vector size (e.g., 1536 → 512) with minimal quality loss - Serverless vs provisioned — Use serverless (Pinecone) for variable workloads, provisioned for steady high throughput
- Batch embedding — Embed documents in batches, not one at a time, to reduce API calls
- Cache frequent queries — Cache embedding + results for repeated queries
Vector Database Selection Criteria
| Criteria | Pinecone | Weaviate | Milvus | Chroma | Qdrant |
|---|---|---|---|---|---|
| Deployment | Fully managed | Self-hosted + cloud | Self-hosted + Zilliz | Embedded + cloud | Self-hosted + cloud |
| Max vectors | Billions | Millions | Billions | Millions | Millions |
| Filtering | Metadata filtering | Hybrid (vector + BM25) | Attribute filtering | Metadata filtering | Payload filtering |
| Ease of use | Easiest | Good | Complex | Easiest | Good |
| Cost | Per-pod pricing | Free (self-hosted) | Free (self-hosted) | Free (self-hosted) | Free (self-hosted) |
| Best for | Production SaaS | Hybrid search | Large-scale enterprise | Prototyping, local dev | Performance-focused |
Embedding Dimension Trade-Offs
Higher dimensions capture more semantic nuance but increase storage and computation costs:
- 256-384 dimensions — Sufficient for simple similarity search, fastest queries, lowest cost
- 768-1024 dimensions — Good balance for most production use cases
- 1536-3072 dimensions — Maximum quality for complex retrieval, highest cost
- Quantization — Reduce storage 4x by converting float32 to int8 with less than 5 percent quality loss
Checklist
- Embedding model selected and benchmarked against your data
- Chunking strategy tested with representative documents
- Vector database selected based on scale, ops capability, and budget
- Index parameters configured and tuned (M, ef_construction, ef_search)
- Metadata schema designed for filtering (categories, dates, sources)
- Retrieval quality measured (precision@k, recall@k, MRR)
- Backup and disaster recovery planned (index snapshots, embedding re-generation)
- Cost projections calculated for expected query volume and storage
- Monitoring in place (query latency p99, index size, error rates)
:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For AI infrastructure consulting, visit garnetgrid.com. :::