ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

Rate Limiting and API Throttling: Protecting Services at Scale

Implementation patterns for rate limiting and API throttling — covering algorithms, distributed rate limiting, quota management, and graceful degradation.

Rate limiting is a critical mechanism for protecting services from abuse, ensuring fair access, and maintaining system stability under load. Without it, a single client can monopolize resources, a bug can trigger a DDoS-like cascade, or a viral moment can bring down your entire platform.

Why Rate Limiting Matters

  • Protect backend services from being overwhelmed
  • Ensure fair usage across tenants and API consumers
  • Prevent abuse from scrapers, bots, and malicious actors
  • Control costs for metered infrastructure (cloud, third-party APIs)
  • Enable SLA enforcement for tiered pricing plans

Rate Limiting Algorithms

1. Fixed Window Counter

The simplest approach. Count requests in fixed time windows.

Window: 1 minute
Limit: 100 requests
Counter resets at the top of each minute

Timeline:
12:00:00 ────────────── 12:01:00 ────────────── 12:02:00
    [     100 allowed     ]  [   100 allowed     ]

Problem: Boundary burst. A client could send 100 requests at 12:00:59 and 100 more at 12:01:00 — 200 requests in 2 seconds.

2. Sliding Window Log

Track the timestamp of every request. Count requests within the sliding window.

def is_allowed(client_id, limit, window_seconds):
    now = time.time()
    key = f"rate:{client_id}"
    
    # Remove entries outside the window
    redis.zremrangebyscore(key, 0, now - window_seconds)
    
    # Count requests in window
    count = redis.zcard(key)
    
    if count < limit:
        redis.zadd(key, {str(now): now})
        redis.expire(key, window_seconds)
        return True
    return False

Pros: Exact window, no boundary burst. Cons: High memory usage — stores every timestamp.

3. Sliding Window Counter

Hybrid approach: use the previous window’s count weighted by overlap.

Previous window count: 84 (60% of window remaining)
Current window count: 36
Estimated rate: 84 × 0.6 + 36 = 86.4
Limit: 100 → Allowed

Pros: Low memory, no boundary burst, constant-time lookup. Cons: Approximate (but close enough for production).

4. Token Bucket

A bucket holds tokens that refill at a constant rate. Each request consumes one token.

class TokenBucket:
    def __init__(self, capacity, refill_rate):
        self.capacity = capacity
        self.tokens = capacity
        self.refill_rate = refill_rate  # tokens per second
        self.last_refill = time.time()
    
    def consume(self, tokens=1):
        self._refill()
        if self.tokens >= tokens:
            self.tokens -= tokens
            return True
        return False
    
    def _refill(self):
        now = time.time()
        elapsed = now - self.last_refill
        self.tokens = min(self.capacity, 
                         self.tokens + elapsed * self.refill_rate)
        self.last_refill = now

Pros: Allows controlled bursts (up to bucket capacity). Smooth average rate. Cons: Slightly more complex implementation.

5. Leaky Bucket

Requests enter a queue (bucket) and are processed at a constant rate. Overflow is dropped.

Pros: Perfectly smooth output rate. Great for downstream services with fixed capacity. Cons: Adds latency (queuing). Drops requests during bursts.

Algorithm Comparison

AlgorithmBurst HandlingMemoryAccuracyComplexity
Fixed WindowPoor (boundary)O(1)ApproximateSimple
Sliding LogNone (exact)O(n)ExactMedium
Sliding CounterGoodO(1)Near-exactMedium
Token BucketControlledO(1)GoodMedium
Leaky BucketNone (queued)O(n)ExactMedium

Distributed Rate Limiting

Single-instance rate limiting doesn’t work when you have multiple API gateway instances. Every instance needs to share state.

Centralized Counter (Redis)

API Gateway 1 ┐
API Gateway 2 ├──→ Redis (shared counter) ──→ Allow/Deny
API Gateway 3 ┘

Implementation:

-- Redis Lua script for atomic rate limiting
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])

local current = tonumber(redis.call('GET', key) or '0')
if current < limit then
    redis.call('INCR', key)
    redis.call('EXPIRE', key, window)
    return 1  -- allowed
end
return 0  -- denied

Local + Sync

Each instance maintains a local counter and periodically syncs with a central store:

  • Lower latency (no network call per request)
  • Eventually consistent (may over-allow during sync gaps)
  • Acceptable for most use cases

Response Headers

Always communicate rate limit status to clients:

HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 34
X-RateLimit-Reset: 1625097600
Retry-After: 42

When rate limited:

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1625097600
Retry-After: 42
Content-Type: application/json

{"error": "Rate limit exceeded", "retry_after": 42}

Multi-Dimensional Rate Limiting

Real-world APIs limit on multiple dimensions simultaneously:

DimensionLimitScope
Per API key1,000/minConsumer-level
Per IP address100/minAbuse prevention
Per endpoint50/minProtect expensive operations
Per tenant10,000/minSaaS multi-tenancy
Global100,000/minSystem-wide protection

All limits must pass for a request to proceed.

Graceful Degradation

Instead of hard rejection (429), consider graceful degradation:

  1. Serve cached responses — Return stale data instead of no data
  2. Reduce response quality — Skip expensive computations, return simplified results
  3. Queue and process later — Accept the request but process it asynchronously
  4. Shed non-critical requests — Prioritize core functionality over nice-to-haves

Anti-Patterns

Client-Side Rate Limiting Only

Trusting the client to rate-limit itself is like trusting drivers to enforce their own speed limits.

No Rate Limiting on Internal APIs

Internal services can cause cascading failures just like external clients. Rate limit service-to-service calls too.

Overly Aggressive Limits

If legitimate users are regularly hitting limits, the limits are too low. Monitor 429 rates by client and adjust.

No Retry-After Header

Clients need to know when they can retry. Without Retry-After, they’ll retry immediately — making the problem worse.

Rate Limiting Without Monitoring

If you can’t see who’s being rate-limited and why, you can’t tune the limits. Dashboard your 429 rates by client, endpoint, and time.

Production Checklist

  • Rate limit headers on every response
  • 429 responses include Retry-After
  • Distributed rate limiting across all instances
  • Multi-dimensional limits (key + IP + endpoint)
  • Dashboard for 429 rates by client and endpoint
  • Alerting when legitimate clients hit limits
  • Graceful degradation for high-value clients
  • Documentation of limits in API docs
  • Rate limit bypass for internal health checks
  • Load test to verify limits hold under pressure
Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →