ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

Background Job Processing

Design reliable background job systems that handle retries, priorities, rate limiting, and failure recovery. Covers job queue architectures, idempotency, dead letter queues, and the patterns that prevent your background jobs from losing work or running twice.

Not everything belongs in the request-response cycle. Sending emails, processing images, generating reports, syncing data with third-party APIs — these operations are too slow, too unreliable, or too resource-intensive to run synchronously. Background job processing moves this work out of the critical path, improving response times and system resilience.


Architecture

Web Request → API Server → Job Queue → Worker Process → External Systems
                              ↓              ↓
                         Job Storage     Dead Letter Queue
                        (persistent)     (failed jobs)

Components

  • Producer: The API server that enqueues jobs
  • Queue: The ordered list of pending jobs (Redis, RabbitMQ, SQS)
  • Worker: The process that dequeues and executes jobs
  • Storage: Persistent job metadata for monitoring and replay
  • DLQ: Dead letter queue for jobs that fail all retries

Job Queue Selection

QueueStrengthsScalePersistence
Redis (Sidekiq/BullMQ)Fast, simple, real-timeMediumAOF/RDB
RabbitMQRouting, exchanges, reliabilityHighDisk
AWS SQSManaged, infinite scaleUnlimitedManaged
PostgreSQL (SKIP LOCKED)No extra infrastructureModerateFull ACID
KafkaStreaming, replay, orderingUnlimitedDisk

PostgreSQL as a Job Queue

For small to medium workloads, your database is a perfectly good job queue:

CREATE TABLE jobs (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    queue TEXT NOT NULL DEFAULT 'default',
    job_type TEXT NOT NULL,
    payload JSONB NOT NULL,
    status TEXT NOT NULL DEFAULT 'pending',
    priority INTEGER DEFAULT 0,
    run_at TIMESTAMPTZ DEFAULT NOW(),
    attempts INTEGER DEFAULT 0,
    max_attempts INTEGER DEFAULT 3,
    last_error TEXT,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    locked_at TIMESTAMPTZ,
    locked_by TEXT
);

-- Worker fetches next job atomically
UPDATE jobs
SET status = 'running', locked_at = NOW(), locked_by = 'worker-1'
WHERE id = (
    SELECT id FROM jobs
    WHERE status = 'pending'
      AND run_at <= NOW()
    ORDER BY priority DESC, created_at ASC
    FOR UPDATE SKIP LOCKED
    LIMIT 1
)
RETURNING *;

FOR UPDATE SKIP LOCKED is the key — it allows multiple workers to poll concurrently without blocking each other.


Retry Strategies

Exponential Backoff

def calculate_retry_delay(attempt, base_delay=60):
    delay = base_delay * (2 ** attempt)
    jitter = random.uniform(0, delay * 0.1)
    return min(delay + jitter, 3600)  # Cap at 1 hour

Retry Classification

Not every error should be retried:

RETRYABLE = [ConnectionError, TimeoutError, RateLimitError]
NOT_RETRYABLE = [ValidationError, AuthenticationError, NotFoundError]

Idempotency

Jobs must be safe to run more than once. Network failures, worker crashes, and queue redelivery all cause duplicate execution:

def process_payment(job):
    idempotency_key = f"payment-{job.order_id}"
    
    if already_processed(idempotency_key):
        return  # Already done, skip
    
    result = stripe.charges.create(
        amount=job.amount,
        idempotency_key=idempotency_key
    )
    
    mark_processed(idempotency_key, result)
PatternHowUse When
Deduplication tableStore processed job IDsAny job type
Natural idempotency keyUse business identifier (order_id)Payment, notification
Conditional updateUPDATE ... WHERE status != 'done'State machine transitions

Dead Letter Queues

Jobs that exhaust all retries go to a dead letter queue for manual investigation:

def execute_job(job):
    try:
        run_job(job)
        mark_completed(job)
    except RetryableError:
        if job.attempts < job.max_attempts:
            reschedule(job, delay=calculate_retry_delay(job.attempts))
        else:
            move_to_dlq(job, reason="max retries exceeded")
    except NonRetryableError as e:
        move_to_dlq(job, reason=str(e))

Anti-Patterns

Anti-PatternConsequenceFix
Fire-and-forget (no persistence)Lost jobs on crashPersistent queue with acknowledgment
No idempotencyDouble charges, duplicate emailsIdempotency keys on every job
Retry without backoffThundering herd on recovering serviceExponential backoff with jitter
No dead letter queueFailed jobs disappear silentlyDLQ with monitoring and replay
Giant job payloadsQueue memory pressureStore payload in DB, pass ID in queue

Background jobs are the workhorses of production systems. Designed well, they handle millions of operations reliably.

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →