Database Sharding
Scale databases horizontally by distributing data across multiple shards. Covers sharding strategies, shard key selection, cross-shard queries, rebalancing, and the patterns that manage the complexity of distributed data.
Sharding splits a single database into multiple smaller databases (shards), each holding a subset of the data. When a single database server cannot handle the read/write load or data volume, sharding distributes the burden across multiple servers. It is the most powerful and most complex scaling technique for databases.
When to Shard
You probably DON'T need sharding if:
- Single server handles the load (most applications)
- Read replicas solve read scaling
- Vertical scaling (bigger server) is viable
- Partitioning within a single database suffices
You NEED sharding when:
- Write volume exceeds single server capacity
- Data volume exceeds single server storage
- Single-server latency is unacceptable
- Regulatory requirements (data residency per region)
Sharding Strategies
Hash-Based Sharding:
shard_id = hash(shard_key) % num_shards
Pro: Even distribution, simple to implement
Con: Range queries require scatter-gather
Best for: User data, session data, random access patterns
Example: user_id 12345 → hash(12345) % 4 = shard_2
Range-Based Sharding:
shard_1: users A-G
shard_2: users H-N
shard_3: users O-U
shard_4: users V-Z
Pro: Range queries on shard key are efficient
Con: Hot spots (if some ranges are busier)
Best for: Time-series data, geographic data
Directory-Based Sharding:
Lookup table maps each key to a shard
Pro: Flexible placement, easy rebalancing
Con: Lookup table becomes single point of failure
Best for: Tenant-based multi-tenant systems
Shard Key Selection
CRITICAL DECISION: The shard key determines everything.
Good shard key properties:
☐ High cardinality (many distinct values)
☐ Even distribution (no hot spots)
☐ Stable (doesn't change for a record)
☐ Present in most queries (avoids cross-shard queries)
Examples:
Good: user_id (for user-centric applications)
Good: tenant_id (for multi-tenant SaaS)
Good: region (for geographically distributed data)
Bad: status (low cardinality: 3-5 values)
Bad: created_date (all writes go to "today" shard)
Bad: email (queries often don't include email)
Cross-Shard Queries
# The hardest part of sharding: queries that span shards
# Query on shard key (fast):
# "Get user 12345" → goes to exactly 1 shard
def get_user(user_id):
shard = get_shard(user_id)
return shard.query("SELECT * FROM users WHERE id = %s", user_id)
# Cross-shard query (expensive):
# "Get all users who signed up last month"
def get_recent_signups(since_date):
results = []
for shard in all_shards:
results.extend(
shard.query("SELECT * FROM users WHERE created_at > %s", since_date)
)
return sorted(results, key=lambda u: u.created_at)
# Problem: N shard queries, merge in application
# Cross-shard JOIN (very expensive):
# "Get all orders for users in New York"
# → Scatter query to all user shards for NY users
# → Gather user IDs
# → Scatter to all order shards for those user IDs
# → Merge results
# This is why shard key selection is critical
Rebalancing
Why rebalance:
- Shards grow unevenly over time
- Adding new shards to handle growth
- Removing shards to save cost
Techniques:
1. Virtual shards (consistent hashing)
- 256 virtual shards mapped to 4 physical shards
- Add server: move 64 virtual shards to new server
- Minimal data movement
2. Online resharding
- Double-write: write to old and new shard simultaneously
- Backfill: copy historical data to new shard
- Switch: update routing to new shard
- Verify: compare old and new shard data
- Cleanup: decommission old shard
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| Shard too early | Unnecessary complexity | Exhaust vertical scaling first |
| Bad shard key | Hot spots, cross-shard queries | Analyze query patterns before choosing |
| Monotonic shard key (timestamp) | All writes to one shard | Hash-based or compound key |
| Cross-shard transactions | Performance, consistency nightmare | Design to avoid cross-shard txns |
| No rebalancing plan | Shards grow unevenly | Virtual shards, consistent hashing |
Sharding is a one-way door. Once your data is sharded, unsharding is nearly impossible. Choose your shard key wisely, exhaust simpler alternatives first, and design your application queries around the shard key.