Database Sharding | The Garnet Wiki

Sharding splits a single database into multiple smaller databases (shards), each holding a subset of the data. When a single database server cannot handle the read/write load or data volume, sharding distributes the burden across multiple servers. It is the most powerful and most complex scaling technique for databases.

When to Shard

You probably DON'T need sharding if:
  - Single server handles the load (most applications)
  - Read replicas solve read scaling
  - Vertical scaling (bigger server) is viable
  - Partitioning within a single database suffices

You NEED sharding when:
  - Write volume exceeds single server capacity
  - Data volume exceeds single server storage
  - Single-server latency is unacceptable
  - Regulatory requirements (data residency per region)

Sharding Strategies

Hash-Based Sharding:
  shard_id = hash(shard_key) % num_shards
  
  Pro: Even distribution, simple to implement
  Con: Range queries require scatter-gather
  Best for: User data, session data, random access patterns
  
  Example: user_id 12345 → hash(12345) % 4 = shard_2

Range-Based Sharding:
  shard_1: users A-G
  shard_2: users H-N
  shard_3: users O-U
  shard_4: users V-Z
  
  Pro: Range queries on shard key are efficient
  Con: Hot spots (if some ranges are busier)
  Best for: Time-series data, geographic data
  
Directory-Based Sharding:
  Lookup table maps each key to a shard
  
  Pro: Flexible placement, easy rebalancing
  Con: Lookup table becomes single point of failure
  Best for: Tenant-based multi-tenant systems

Shard Key Selection

CRITICAL DECISION: The shard key determines everything.

Good shard key properties:
  ☐ High cardinality (many distinct values)
  ☐ Even distribution (no hot spots)
  ☐ Stable (doesn't change for a record)
  ☐ Present in most queries (avoids cross-shard queries)

Examples:
  Good: user_id (for user-centric applications)
  Good: tenant_id (for multi-tenant SaaS)
  Good: region (for geographically distributed data)
  
  Bad: status (low cardinality: 3-5 values)
  Bad: created_date (all writes go to "today" shard)
  Bad: email (queries often don't include email)

Cross-Shard Queries

# The hardest part of sharding: queries that span shards

# Query on shard key (fast):
# "Get user 12345" → goes to exactly 1 shard
def get_user(user_id):
    shard = get_shard(user_id)
    return shard.query("SELECT * FROM users WHERE id = %s", user_id)

# Cross-shard query (expensive):
# "Get all users who signed up last month"
def get_recent_signups(since_date):
    results = []
    for shard in all_shards:
        results.extend(
            shard.query("SELECT * FROM users WHERE created_at > %s", since_date)
        )
    return sorted(results, key=lambda u: u.created_at)
    # Problem: N shard queries, merge in application

# Cross-shard JOIN (very expensive):
# "Get all orders for users in New York"
# → Scatter query to all user shards for NY users
# → Gather user IDs
# → Scatter to all order shards for those user IDs
# → Merge results
# This is why shard key selection is critical

Rebalancing

Why rebalance:
  - Shards grow unevenly over time
  - Adding new shards to handle growth
  - Removing shards to save cost

Techniques:
  1. Virtual shards (consistent hashing)
     - 256 virtual shards mapped to 4 physical shards
     - Add server: move 64 virtual shards to new server
     - Minimal data movement
  
  2. Online resharding
     - Double-write: write to old and new shard simultaneously
     - Backfill: copy historical data to new shard
     - Switch: update routing to new shard
     - Verify: compare old and new shard data
     - Cleanup: decommission old shard

Anti-Patterns

Anti-Pattern	Consequence	Fix
Shard too early	Unnecessary complexity	Exhaust vertical scaling first
Bad shard key	Hot spots, cross-shard queries	Analyze query patterns before choosing
Monotonic shard key (timestamp)	All writes to one shard	Hash-based or compound key
Cross-shard transactions	Performance, consistency nightmare	Design to avoid cross-shard txns
No rebalancing plan	Shards grow unevenly	Virtual shards, consistent hashing

Sharding is a one-way door. Once your data is sharded, unsharding is nearly impossible. Choose your shard key wisely, exhaust simpler alternatives first, and design your application queries around the shard key.