Database Sharding Strategies

When a single database server cannot handle your write volume, query load, or data size, you shard — splitting data across multiple database instances. Each shard holds a subset of the data, and the application routes queries to the correct shard.

When to Shard

Consider sharding when:
  ✅ Single server cannot handle write throughput
  ✅ Data size exceeds single server storage (> 1-5 TB)
  ✅ Read replicas cannot keep up with query load
  ✅ Vertical scaling has reached practical limits

Do NOT shard if:
  ❌ Read replicas + caching can solve the problem
  ❌ Query optimization has not been exhausted
  ❌ Data fits on one server with headroom
  ❌ Team lacks operational experience for sharded systems

Sharding Strategies

Hash-Based Sharding

shard_id = hash(shard_key) % num_shards

Example: shard_key = user_id
  hash("user_123") % 4 = 2 → Shard 2
  hash("user_456") % 4 = 0 → Shard 0

Pros: Even distribution, simple routing
Cons: Adding shards requires rehashing (unless consistent hashing)

Range-Based Sharding

Shard 1: user_id    1 - 1,000,000
Shard 2: user_id    1,000,001 - 2,000,000
Shard 3: user_id    2,000,001 - 3,000,000

Pros: Range queries within a shard are efficient
Cons: Hot spots if access patterns are skewed (recent users are more active)

Directory-Based Sharding

Lookup Table:
  tenant_id "acme"     → Shard 1
  tenant_id "globex"   → Shard 2
  tenant_id "initech"  → Shard 1

Pros: Flexible, easy to rebalance
Cons: Lookup table is a single point of failure, additional latency

Shard Key Selection

The shard key determines everything. Choose poorly and you get hot spots, cross-shard queries, and operational pain.

Good Shard Keys	Why
tenant_id (multi-tenant)	All tenant data on one shard, no cross-shard queries
user_id	User’s data co-located, most queries are user-scoped
order_id (with prefix)	Even distribution, lookups are direct

Bad Shard Keys	Why
timestamp	All writes go to latest shard (hot spot)
country_code	US shard has 10x data of others
auto-increment ID	Sequential IDs create hot spots

Cross-Shard Queries

The biggest operational challenge:

-- Easy: Query within one shard (shard key in WHERE)
SELECT * FROM orders WHERE user_id = 123;
-- Routes to one shard

-- Hard: Query across all shards
SELECT COUNT(*) FROM orders WHERE created_at > '2026-01-01';
-- Must query ALL shards and aggregate results (scatter-gather)

-- Impossible efficiently: JOIN across shards
SELECT * FROM orders o
JOIN products p ON o.product_id = p.id
WHERE o.user_id = 123;
-- If orders and products are on different shards, this requires
-- fetching from multiple shards and joining in the application

Solutions

1. Denormalize: Copy frequently joined data into each shard
2. Application-level joins: Fetch from each shard, join in code
3. Co-locate related data: Same shard key for related tables
4. Materialized views: Pre-compute cross-shard aggregates

Rebalancing

Adding a new shard:
  Before: 4 shards, 250M rows each
  After:  5 shards, 200M rows each
  
  With hash-based: ~20% of data must move
  With consistent hashing: only 1/5 of data moves
  With directory-based: move tenants one at a time

Anti-Patterns

Anti-Pattern	Consequence	Fix
Sharding too early	Unnecessary complexity	Scale vertically + read replicas first
Wrong shard key	Hot spots, cross-shard queries	Analyze query patterns before choosing key
Cross-shard transactions	Distributed transaction overhead, failures	Design to avoid, or use saga pattern
No shard-aware ORM/driver	Application manages routing manually	Use shard-aware middleware (Vitess, Citus)
Uneven shard sizes	One shard overwhelmed	Monitor shard sizes, rebalance proactively

Sharding is the last resort for scaling, not the first. It permanently increases the complexity of every query, every migration, and every operational task. Make sure you have exhausted simpler options before you shard.