Load Balancing Strategies: Beyond Round Robin

Load balancing looks simple from the outside: distribute traffic across multiple servers. In practice, the choice of algorithm, health check configuration, and failover behavior determine whether your service degrades gracefully under pressure or falls off a cliff.

Most teams use the default round-robin algorithm, never look at it again, and then wonder why one server is at 95% CPU while three others are at 20%. This guide covers how to choose the right balancing strategy and configure it to handle the failure modes that actually happen in production.

L4 vs L7: Where to Balance

Layer	What It Sees	Use When	Examples
L4 (Transport)	IP addresses, TCP/UDP ports	High throughput, simple routing, non-HTTP protocols	NLB, HAProxy (TCP mode)
L7 (Application)	HTTP headers, URL paths, cookies	Content-based routing, SSL termination, API gateway	ALB, HAProxy (HTTP mode), Envoy, Nginx

L4 load balancer:
  Sees: TCP connection from 10.0.1.50:52441 → 10.0.2.10:443
  Decides: Route to backend 3 (based on IP hash or round robin)
  Cannot see: HTTP method, URL path, headers, cookies

L7 load balancer:
  Sees: GET /api/v2/orders HTTP/1.1
        Host: api.example.com
        Cookie: session=abc123
  Decides: Route /api/v2/* to API cluster, /static/* to CDN,
           session=abc123 → backend 2 (sticky session)

General rule: Use L7 for web applications and APIs (you almost always need content-based routing). Use L4 for databases, message queues, and high-throughput non-HTTP services.

Load Balancing Algorithms

Algorithm	How It Works	Best For	Weakness
Round Robin	Each request goes to the next server in sequence	Homogeneous backends, stateless requests	Ignores server load and request cost
Weighted Round Robin	Round robin with weight per server	Mixed capacity servers (old + new hardware)	Weights must be manually set
Least Connections	Route to the server with fewest active connections	Requests with variable processing time	Can overload recovering servers
Least Response Time	Route to the fastest-responding server	Latency-sensitive applications	Requires response time measurement
IP Hash	Hash client IP to determine server	Rough session affinity without cookies	Uneven distribution with NAT
Random	Pick a random server	Simple, surprisingly effective	No load awareness
Power of Two Choices	Pick 2 random servers, use the less loaded one	Large clusters, combines simplicity with load awareness	Slightly more complex

When Round Robin Fails

Scenario: 4 backend servers, one request type takes 10x longer

Round Robin distribution:
  Server 1: light, light, HEAVY, light   → 10% CPU, then spikes to 90%
  Server 2: light, light, light, light    → 15% CPU
  Server 3: light, HEAVY, light, HEAVY   → 85% CPU (overloaded!)
  Server 4: light, light, light, light    → 15% CPU

Least Connections distribution:
  Server 1: light, light, light, light    → 20% CPU
  Server 2: light, light, HEAVY           → 45% CPU (gets fewer new connections
  Server 3: light, light, light            while processing heavy request)
  Server 4: light, HEAVY, light           → 45% CPU

  Better balance because heavy requests "protect" the server from getting
  more traffic while they are processing.

Health Checks: The First Line of Defense

A bad health check configuration is worse than no health checks at all. Too aggressive and you flap servers in and out. Too lenient and you route traffic to dead servers for minutes.

# Good health check configuration
health_check:
  protocol: HTTP
  path: /health
  port: 8080
  interval: 10s          # Check every 10 seconds
  timeout: 5s            # Give up after 5 seconds
  healthy_threshold: 2   # Mark healthy after 2 consecutive successes
  unhealthy_threshold: 3 # Mark unhealthy after 3 consecutive failures

# What /health should check:
#   ✅ Application is started and accepting connections
#   ✅ Critical dependencies are reachable (database, cache)
#   ❌ NOT full business logic validation (too slow, too flaky)
#   ❌ NOT external third-party APIs (their outage ≠ your server is unhealthy)

Health Check Mistake	What Happens	Fix
No health check	Traffic routes to crashed servers	Always configure health checks
Checks external dependency	One API goes down, ALL your servers marked unhealthy	Only check critical internal dependencies
1-failure threshold	Healthy server marked down from one timeout	Use threshold ≥ 3
Slow health endpoint	Health check times out because it runs a full DB query	Health endpoint should respond in < 100ms

Deep vs Shallow Health Checks

# Shallow health check (fast, always works)
@app.get("/health")
def health():
    return {"status": "ok"}

# Deep health check (verifies dependencies)
@app.get("/health/ready")
async def readiness():
    checks = {}
    
    # Database
    try:
        await db.execute("SELECT 1")
        checks["database"] = "ok"
    except Exception:
        checks["database"] = "failed"
    
    # Redis cache
    try:
        await redis.ping()
        checks["cache"] = "ok"
    except Exception:
        checks["cache"] = "failed"
    
    all_ok = all(v == "ok" for v in checks.values())
    status_code = 200 if all_ok else 503
    return JSONResponse({"checks": checks}, status_code=status_code)

Use shallow checks for liveness probes (is the process running?). Use deep checks for readiness probes (is the service ready to handle traffic?).

Connection Draining

When a server is taken out of rotation (for deployment, scaling, or failure), existing connections should be allowed to finish rather than being killed mid-request.

# Connection draining configuration
deregistration_delay: 30s    # Allow 30 seconds for in-flight requests
# During this window:
#   - No NEW connections routed to this server
#   - Existing connections continue until complete or timeout
#   - After 30s: remaining connections are forcefully closed

Drain Duration	Suitable For
5-10 seconds	Fast APIs (< 1s response time)
30 seconds	Standard web applications
60-120 seconds	Long-running requests (file uploads, reports)
300 seconds	WebSocket connections

Session Affinity (Sticky Sessions)

Method	How It Works	Tradeoff
Cookie-based	Load balancer sets a cookie mapping to a backend	Most reliable, requires L7
IP-based	Hash source IP to determine backend	Breaks with NAT/proxies
No affinity	Any server handles any request	Requires externalized session state (Redis)

Strong recommendation: Avoid sticky sessions. Externalize session state to Redis or a database. Sticky sessions create uneven load distribution and make deployments harder (you cannot drain a server if 30% of sessions are pinned to it).

L4 vs L7: Where to Balance

Load Balancing Algorithms

When Round Robin Fails

Health Checks: The First Line of Defense

Deep vs Shallow Health Checks

Connection Draining

Session Affinity (Sticky Sessions)

Implementation Checklist

More in Networking

API Gateway Networking: Traffic Management at the Edge

BGP Fundamentals for Engineers

CDN Architecture: Serving Content at the Edge