Vector Databases: Architecture & Selection Guide

Vector databases store and search high-dimensional embeddings — the numerical representations that AI models create from text, images, audio, and other unstructured data. They power semantic search, recommendation engines, RAG (Retrieval-Augmented Generation) systems, anomaly detection, and de-duplication. As AI adoption accelerates, vector databases have become critical infrastructure.

This guide covers how vector search works internally, algorithm trade-offs, database selection criteria, and production best practices that prevent performance and cost surprises.

How Vector Search Works

Step 1: Generate Embeddings

An embedding model converts unstructured data (text, images) into fixed-length numerical vectors. Similar content produces similar vectors, enabling mathematical similarity comparison.

from openai import OpenAI
client = OpenAI()

response = client.embeddings.create(
    input="How do I optimize PostgreSQL for high-traffic applications?",
    model="text-embedding-3-small"
)
embedding = response.data[0].embedding  # [0.012, -0.045, ...] (1536 floats)

Step 2: Choose a Distance Metric

The distance metric determines how “similarity” is calculated between vectors.

Metric	Best For	How It Works
Cosine Similarity	Text embeddings (normalized vectors)	Measures angle between vectors — 1.0 = identical, 0.0 = unrelated
Euclidean (L2)	Image embeddings, spatial data	Measures straight-line distance — 0.0 = identical, higher = farther
Dot Product	When magnitude matters (popularity, relevance scoring)	Combines direction and magnitude — higher = more similar

Rule of thumb: Use cosine similarity for text-based applications. Most embedding models produce normalized vectors where cosine similarity and dot product are equivalent.

ANN Algorithms Deep Dive

Exact nearest-neighbor search (brute force) is O(n) — too slow at scale. Approximate Nearest Neighbor (ANN) algorithms trade small accuracy losses for massive speed gains.

HNSW (Hierarchical Navigable Small World)

The most popular algorithm. HNSW builds a multi-layer graph (like a skip list) where higher layers contain fewer nodes for fast traversal, and lower layers contain all nodes for precision.

Key Parameters:

Parameter	What It Controls	Recommended Range	Trade-off
`M`	Max connections per node	16-64	Higher = better recall, more memory
`ef_construction`	Build-time search width	100-400	Higher = better graph quality, slower build
`ef_search`	Query-time search width	50-200	Higher = better recall, slower queries

When to use: Most production workloads. Best balance of speed, accuracy, and memory.

IVF (Inverted File Index)

Clusters vectors into partitions (Voronoi cells). At query time, searches only the closest clusters instead of the full dataset.

Key Parameters:

Parameter	What It Controls	Recommended
`nlist`	Number of clusters	√(n) to 4×√(n)
`nprobe`	Clusters searched per query	5-20% of nlist

When to use: Very large datasets where memory is constrained. Often combined with Product Quantization (PQ) for compression.

Algorithm Comparison

Algorithm	Speed	Memory	Accuracy	Build Time
HNSW	Fastest (sub-ms)	High (full vectors in memory)	Highest (99%+ recall)	Slow
IVF-PQ	Fast	Low (compressed vectors)	Good (90-95% recall)	Medium
IVF-Flat	Medium	Medium	Very Good (97%+ recall)	Fast
Flat (exact)	Slow (O(n))	Low	Perfect (100%)	None
ScaNN	Very Fast	Medium	Very High	Medium

Database Comparison

Feature	Pinecone	Weaviate	Qdrant	Milvus	pgvector
Hosting	Fully managed	Both (cloud + self-hosted)	Both	Self-hosted (Zilliz Cloud)	PostgreSQL extension
Algorithm	Proprietary (optimized)	HNSW	HNSW	HNSW + IVF + ScaNN	HNSW + IVF
Hybrid search	Yes (dense + sparse)	Yes (BM25 + vector)	Yes (sparse + dense)	Yes	With tsvector
Max scale	Billions (serverless)	Billions	Billions	Billions	Tens of millions
Filtering	Metadata filtering (pre/post)	GraphQL-style filters	Payload filtering	Attribute filtering	Standard SQL WHERE
Multi-tenancy	Namespaces	Multi-tenant classes	Collection-level	Partitions	Schema-level
Best for	Simplicity, serverless	Multi-modal (text + images)	Raw performance	Large-scale ML pipelines	Existing PostgreSQL installations
Pricing	Pay-per-query (serverless)	Open source + cloud	Open source + cloud	Open source + Zilliz	Free (PostgreSQL)

Implementation: Pinecone (Managed)

from pinecone import Pinecone

pc = Pinecone(api_key="your-api-key")
index = pc.Index("knowledge-base")

# Upsert documents with metadata for filtering
index.upsert(vectors=[{
    "id": "doc-001",
    "values": embedding_vector,
    "metadata": {
        "source": "guide.pdf",
        "category": "cloud",
        "text": "How to optimize cloud costs...",
        "word_count": 1500,
        "published": "2025-01-15"
    }
}])

# Semantic search with metadata filter
results = index.query(
    vector=query_embedding,
    top_k=10,
    filter={"category": {"$eq": "cloud"}},
    include_metadata=True
)

for match in results['matches']:
    print(f"Score: {match['score']:.3f} | {match['metadata']['text'][:80]}")

Implementation: pgvector (Self-Hosted)

If you already run PostgreSQL, pgvector avoids adding a new database to your infrastructure.

-- Enable the extension
CREATE EXTENSION vector;

-- Create table with vector column
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    title TEXT,
    content TEXT,
    category TEXT,
    embedding vector(1536),
    created_at TIMESTAMP DEFAULT NOW()
);

-- Create HNSW index for fast similarity search
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 200);

-- Semantic search with SQL filtering
SELECT id, title,
       1 - (embedding <=> $1::vector) AS similarity
FROM documents
WHERE category = 'cloud'
ORDER BY embedding <=> $1::vector
LIMIT 10;

pgvector Limitations

pgvector works well up to 5-10 million vectors. Beyond that, query latency increases and memory requirements grow quickly. If you expect to exceed this scale, plan a migration path to a dedicated vector database.

Selection Decision Tree

Need managed, serverless, minimal ops? → Pinecone
Already using PostgreSQL and < 5M vectors? → pgvector
Need multi-modal (text + images + video)? → Weaviate
Need max query performance, self-hosted? → Qdrant or Milvus
Prototyping / local development? → Chroma or LanceDB
Need hybrid search (keyword + semantic)? → Pinecone, Weaviate, or Qdrant
Budget constrained, open-source preferred? → Qdrant or pgvector

Production Best Practices

Chunking Strategy

Chunking (splitting documents into smaller pieces before embedding) directly impacts retrieval quality. Too large and the embedding averages out important details. Too small and context is lost.

# Recommended: ~500 tokens with 50-token overlap
# Overlap prevents information loss at chunk boundaries
chunks = split_text(document, chunk_size=500, chunk_overlap=50)

Content Type	Recommended Chunk Size	Overlap
Technical documentation	400-600 tokens	50-100 tokens
Legal contracts	200-400 tokens	50 tokens
Code files	Per function/class	0 (natural boundaries)
FAQ/knowledge base	Per question-answer pair	0

Embedding Model Selection

Model	Dimensions	Cost	Quality
text-embedding-3-small (OpenAI)	1536	$0.02/M tokens	Good
text-embedding-3-large (OpenAI)	3072	$0.13/M tokens	Better
Cohere embed-v3	1024	$0.10/M tokens	Very Good
BGE-M3 (open source)	1024	Free (self-hosted)	Good
Voyage AI voyage-3	1024	$0.06/M tokens	Excellent

Cost Control

Dimensionality reduction — OpenAI’s embedding-3 models support dimensions parameter to reduce vector size (e.g., 1536 → 512) with minimal quality loss
Serverless vs provisioned — Use serverless (Pinecone) for variable workloads, provisioned for steady high throughput
Batch embedding — Embed documents in batches, not one at a time, to reduce API calls
Cache frequent queries — Cache embedding + results for repeated queries

Vector Database Selection Criteria

Criteria	Pinecone	Weaviate	Milvus	Chroma	Qdrant
Deployment	Fully managed	Self-hosted + cloud	Self-hosted + Zilliz	Embedded + cloud	Self-hosted + cloud
Max vectors	Billions	Millions	Billions	Millions	Millions
Filtering	Metadata filtering	Hybrid (vector + BM25)	Attribute filtering	Metadata filtering	Payload filtering
Ease of use	Easiest	Good	Complex	Easiest	Good
Cost	Per-pod pricing	Free (self-hosted)	Free (self-hosted)	Free (self-hosted)	Free (self-hosted)
Best for	Production SaaS	Hybrid search	Large-scale enterprise	Prototyping, local dev	Performance-focused

Embedding Dimension Trade-Offs

Higher dimensions capture more semantic nuance but increase storage and computation costs:

256-384 dimensions — Sufficient for simple similarity search, fastest queries, lowest cost
768-1024 dimensions — Good balance for most production use cases
1536-3072 dimensions — Maximum quality for complex retrieval, highest cost
Quantization — Reduce storage 4x by converting float32 to int8 with less than 5 percent quality loss

Checklist

Embedding model selected and benchmarked against your data
Chunking strategy tested with representative documents
Vector database selected based on scale, ops capability, and budget
Index parameters configured and tuned (M, ef_construction, ef_search)
Metadata schema designed for filtering (categories, dates, sources)
Retrieval quality measured (precision@k, recall@k, MRR)
Backup and disaster recovery planned (index snapshots, embedding re-generation)
Cost projections calculated for expected query volume and storage
Monitoring in place (query latency p99, index size, error rates)

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For AI infrastructure consulting, visit garnetgrid.com. :::