Background Job Processing
Design reliable background job systems that handle retries, priorities, rate limiting, and failure recovery. Covers job queue architectures, idempotency, dead letter queues, and the patterns that prevent your background jobs from losing work or running twice.
Not everything belongs in the request-response cycle. Sending emails, processing images, generating reports, syncing data with third-party APIs — these operations are too slow, too unreliable, or too resource-intensive to run synchronously. Background job processing moves this work out of the critical path, improving response times and system resilience.
Architecture
Web Request → API Server → Job Queue → Worker Process → External Systems
↓ ↓
Job Storage Dead Letter Queue
(persistent) (failed jobs)
Components
- Producer: The API server that enqueues jobs
- Queue: The ordered list of pending jobs (Redis, RabbitMQ, SQS)
- Worker: The process that dequeues and executes jobs
- Storage: Persistent job metadata for monitoring and replay
- DLQ: Dead letter queue for jobs that fail all retries
Job Queue Selection
| Queue | Strengths | Scale | Persistence |
|---|---|---|---|
| Redis (Sidekiq/BullMQ) | Fast, simple, real-time | Medium | AOF/RDB |
| RabbitMQ | Routing, exchanges, reliability | High | Disk |
| AWS SQS | Managed, infinite scale | Unlimited | Managed |
| PostgreSQL (SKIP LOCKED) | No extra infrastructure | Moderate | Full ACID |
| Kafka | Streaming, replay, ordering | Unlimited | Disk |
PostgreSQL as a Job Queue
For small to medium workloads, your database is a perfectly good job queue:
CREATE TABLE jobs (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
queue TEXT NOT NULL DEFAULT 'default',
job_type TEXT NOT NULL,
payload JSONB NOT NULL,
status TEXT NOT NULL DEFAULT 'pending',
priority INTEGER DEFAULT 0,
run_at TIMESTAMPTZ DEFAULT NOW(),
attempts INTEGER DEFAULT 0,
max_attempts INTEGER DEFAULT 3,
last_error TEXT,
created_at TIMESTAMPTZ DEFAULT NOW(),
locked_at TIMESTAMPTZ,
locked_by TEXT
);
-- Worker fetches next job atomically
UPDATE jobs
SET status = 'running', locked_at = NOW(), locked_by = 'worker-1'
WHERE id = (
SELECT id FROM jobs
WHERE status = 'pending'
AND run_at <= NOW()
ORDER BY priority DESC, created_at ASC
FOR UPDATE SKIP LOCKED
LIMIT 1
)
RETURNING *;
FOR UPDATE SKIP LOCKED is the key — it allows multiple workers to poll concurrently without blocking each other.
Retry Strategies
Exponential Backoff
def calculate_retry_delay(attempt, base_delay=60):
delay = base_delay * (2 ** attempt)
jitter = random.uniform(0, delay * 0.1)
return min(delay + jitter, 3600) # Cap at 1 hour
Retry Classification
Not every error should be retried:
RETRYABLE = [ConnectionError, TimeoutError, RateLimitError]
NOT_RETRYABLE = [ValidationError, AuthenticationError, NotFoundError]
Idempotency
Jobs must be safe to run more than once. Network failures, worker crashes, and queue redelivery all cause duplicate execution:
def process_payment(job):
idempotency_key = f"payment-{job.order_id}"
if already_processed(idempotency_key):
return # Already done, skip
result = stripe.charges.create(
amount=job.amount,
idempotency_key=idempotency_key
)
mark_processed(idempotency_key, result)
| Pattern | How | Use When |
|---|---|---|
| Deduplication table | Store processed job IDs | Any job type |
| Natural idempotency key | Use business identifier (order_id) | Payment, notification |
| Conditional update | UPDATE ... WHERE status != 'done' | State machine transitions |
Dead Letter Queues
Jobs that exhaust all retries go to a dead letter queue for manual investigation:
def execute_job(job):
try:
run_job(job)
mark_completed(job)
except RetryableError:
if job.attempts < job.max_attempts:
reschedule(job, delay=calculate_retry_delay(job.attempts))
else:
move_to_dlq(job, reason="max retries exceeded")
except NonRetryableError as e:
move_to_dlq(job, reason=str(e))
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| Fire-and-forget (no persistence) | Lost jobs on crash | Persistent queue with acknowledgment |
| No idempotency | Double charges, duplicate emails | Idempotency keys on every job |
| Retry without backoff | Thundering herd on recovering service | Exponential backoff with jitter |
| No dead letter queue | Failed jobs disappear silently | DLQ with monitoring and replay |
| Giant job payloads | Queue memory pressure | Store payload in DB, pass ID in queue |
Background jobs are the workhorses of production systems. Designed well, they handle millions of operations reliably.