Load Balancing Strategies: Beyond Round Robin
Choose and configure load balancing algorithms that match your workload characteristics. Covers L4 vs L7 balancing, health checks, connection draining, session affinity, and the failure modes that take down your entire service.
Load balancing looks simple from the outside: distribute traffic across multiple servers. In practice, the choice of algorithm, health check configuration, and failover behavior determine whether your service degrades gracefully under pressure or falls off a cliff.
Most teams use the default round-robin algorithm, never look at it again, and then wonder why one server is at 95% CPU while three others are at 20%. This guide covers how to choose the right balancing strategy and configure it to handle the failure modes that actually happen in production.
L4 vs L7: Where to Balance
| Layer | What It Sees | Use When | Examples |
|---|---|---|---|
| L4 (Transport) | IP addresses, TCP/UDP ports | High throughput, simple routing, non-HTTP protocols | NLB, HAProxy (TCP mode) |
| L7 (Application) | HTTP headers, URL paths, cookies | Content-based routing, SSL termination, API gateway | ALB, HAProxy (HTTP mode), Envoy, Nginx |
L4 load balancer:
Sees: TCP connection from 10.0.1.50:52441 → 10.0.2.10:443
Decides: Route to backend 3 (based on IP hash or round robin)
Cannot see: HTTP method, URL path, headers, cookies
L7 load balancer:
Sees: GET /api/v2/orders HTTP/1.1
Host: api.example.com
Cookie: session=abc123
Decides: Route /api/v2/* to API cluster, /static/* to CDN,
session=abc123 → backend 2 (sticky session)
General rule: Use L7 for web applications and APIs (you almost always need content-based routing). Use L4 for databases, message queues, and high-throughput non-HTTP services.
Load Balancing Algorithms
| Algorithm | How It Works | Best For | Weakness |
|---|---|---|---|
| Round Robin | Each request goes to the next server in sequence | Homogeneous backends, stateless requests | Ignores server load and request cost |
| Weighted Round Robin | Round robin with weight per server | Mixed capacity servers (old + new hardware) | Weights must be manually set |
| Least Connections | Route to the server with fewest active connections | Requests with variable processing time | Can overload recovering servers |
| Least Response Time | Route to the fastest-responding server | Latency-sensitive applications | Requires response time measurement |
| IP Hash | Hash client IP to determine server | Rough session affinity without cookies | Uneven distribution with NAT |
| Random | Pick a random server | Simple, surprisingly effective | No load awareness |
| Power of Two Choices | Pick 2 random servers, use the less loaded one | Large clusters, combines simplicity with load awareness | Slightly more complex |
When Round Robin Fails
Scenario: 4 backend servers, one request type takes 10x longer
Round Robin distribution:
Server 1: light, light, HEAVY, light → 10% CPU, then spikes to 90%
Server 2: light, light, light, light → 15% CPU
Server 3: light, HEAVY, light, HEAVY → 85% CPU (overloaded!)
Server 4: light, light, light, light → 15% CPU
Least Connections distribution:
Server 1: light, light, light, light → 20% CPU
Server 2: light, light, HEAVY → 45% CPU (gets fewer new connections
Server 3: light, light, light while processing heavy request)
Server 4: light, HEAVY, light → 45% CPU
Better balance because heavy requests "protect" the server from getting
more traffic while they are processing.
Health Checks: The First Line of Defense
A bad health check configuration is worse than no health checks at all. Too aggressive and you flap servers in and out. Too lenient and you route traffic to dead servers for minutes.
# Good health check configuration
health_check:
protocol: HTTP
path: /health
port: 8080
interval: 10s # Check every 10 seconds
timeout: 5s # Give up after 5 seconds
healthy_threshold: 2 # Mark healthy after 2 consecutive successes
unhealthy_threshold: 3 # Mark unhealthy after 3 consecutive failures
# What /health should check:
# ✅ Application is started and accepting connections
# ✅ Critical dependencies are reachable (database, cache)
# ❌ NOT full business logic validation (too slow, too flaky)
# ❌ NOT external third-party APIs (their outage ≠ your server is unhealthy)
| Health Check Mistake | What Happens | Fix |
|---|---|---|
| No health check | Traffic routes to crashed servers | Always configure health checks |
| Checks external dependency | One API goes down, ALL your servers marked unhealthy | Only check critical internal dependencies |
| 1-failure threshold | Healthy server marked down from one timeout | Use threshold ≥ 3 |
| Slow health endpoint | Health check times out because it runs a full DB query | Health endpoint should respond in < 100ms |
Deep vs Shallow Health Checks
# Shallow health check (fast, always works)
@app.get("/health")
def health():
return {"status": "ok"}
# Deep health check (verifies dependencies)
@app.get("/health/ready")
async def readiness():
checks = {}
# Database
try:
await db.execute("SELECT 1")
checks["database"] = "ok"
except Exception:
checks["database"] = "failed"
# Redis cache
try:
await redis.ping()
checks["cache"] = "ok"
except Exception:
checks["cache"] = "failed"
all_ok = all(v == "ok" for v in checks.values())
status_code = 200 if all_ok else 503
return JSONResponse({"checks": checks}, status_code=status_code)
Use shallow checks for liveness probes (is the process running?). Use deep checks for readiness probes (is the service ready to handle traffic?).
Connection Draining
When a server is taken out of rotation (for deployment, scaling, or failure), existing connections should be allowed to finish rather than being killed mid-request.
# Connection draining configuration
deregistration_delay: 30s # Allow 30 seconds for in-flight requests
# During this window:
# - No NEW connections routed to this server
# - Existing connections continue until complete or timeout
# - After 30s: remaining connections are forcefully closed
| Drain Duration | Suitable For |
|---|---|
| 5-10 seconds | Fast APIs (< 1s response time) |
| 30 seconds | Standard web applications |
| 60-120 seconds | Long-running requests (file uploads, reports) |
| 300 seconds | WebSocket connections |
Session Affinity (Sticky Sessions)
| Method | How It Works | Tradeoff |
|---|---|---|
| Cookie-based | Load balancer sets a cookie mapping to a backend | Most reliable, requires L7 |
| IP-based | Hash source IP to determine backend | Breaks with NAT/proxies |
| No affinity | Any server handles any request | Requires externalized session state (Redis) |
Strong recommendation: Avoid sticky sessions. Externalize session state to Redis or a database. Sticky sessions create uneven load distribution and make deployments harder (you cannot drain a server if 30% of sessions are pinned to it).
Implementation Checklist
- Choose L4 or L7 based on your routing needs (L7 for web/API, L4 for non-HTTP)
- Select algorithm based on workload: least connections for variable-cost requests
- Configure health checks with ≥ 3 failure threshold and < 100ms response time
- Separate liveness (shallow) and readiness (deep) health checks
- Set connection draining to at least 30 seconds on all services
- Avoid sticky sessions — externalize session state to Redis
- Monitor backend health: track which servers are going in and out of rotation
- Test failover: pull a server out of rotation and verify zero dropped requests
- Set up alerts for: high error rates, uneven load distribution, health check failures
- Document your load balancing setup: algorithm, health check config, drain settings