Message Queue Architecture | The Garnet Wiki

Message queues decouple producers from consumers, allowing systems to handle traffic spikes, process work asynchronously, and survive temporary failures. The difference between a well-designed queue system and a poorly designed one is the difference between graceful degradation and silent data loss.

When to Use Queues vs. Direct Calls

Use Queues When	Use Direct Calls When
Work can be processed asynchronously	Response is needed immediately
Producer and consumer have different throughput rates	Low latency is critical
Downstream service may be temporarily unavailable	Target service is always available
Work is expensive and should be rate-limited	Processing is fast and cheap
You need durability (work survives crashes)	Fire-and-forget is acceptable
Multiple consumers need the same message	Single target consumer

Queue Technology Comparison

Technology	Model	Best For	Ordering
RabbitMQ	Message broker	Task queues, routing patterns	Per-queue FIFO
Apache Kafka	Event streaming	Event sourcing, high-throughput logs	Per-partition
AWS SQS	Managed queue	Serverless, AWS-native	Best-effort (FIFO available)
AWS SNS + SQS	Fan-out	One-to-many notification	Per-subscription
Redis Streams	Lightweight streaming	Simple streaming, existing Redis	Per-stream
Azure Service Bus	Managed broker	Enterprise, .NET integration	FIFO with sessions
Google Pub/Sub	Managed pub/sub	GCP-native, global distribution	Per-subscription
NATS	Lightweight messaging	Microservices, low-latency	Per-subject

Delivery Guarantees

Guarantee	Meaning	Tradeoff
At-most-once	Message delivered 0 or 1 times	Fast, but may lose messages
At-least-once	Message delivered 1+ times	Reliable, but may duplicate
Exactly-once	Message delivered exactly 1 time	Expensive, complex, often impossible at scale

Making At-Least-Once Safe (Idempotency)

Producer:
    message = {
        id: "order-12345-payment",    // Idempotency key
        action: "process_payment",
        payload: { order_id: 12345, amount: 99.99 }
    }

Consumer:
    if already_processed(message.id):
        acknowledge and skip    // Duplicate detected
    else:
        process(message.payload)
        mark_processed(message.id)
        acknowledge

Dead Letter Queue (DLQ) Design

Main Queue → Consumer attempts processing
    ↓ (success) → Acknowledge, done
    ↓ (failure) → Retry (up to N times)
        ↓ (max retries exceeded) → Dead Letter Queue
            ↓ → Alert, manual investigation, replay

Configuration	Recommended Value	Rationale
Max retries	3-5	Enough for transient errors, not infinite loops
Retry backoff	Exponential (1s, 5s, 25s)	Gives downstream time to recover
DLQ retention	14 days	Time for investigation and replay
DLQ alerting	Immediate on first message	Every DLQ message is a failure worth investigating

Consumer Patterns

Pattern	Description	When to Use
Competing consumers	Multiple consumers on same queue	Scale processing horizontally
Fan-out	One message to multiple queues	Multiple services need same event
Request-reply	Response sent to reply queue	Async RPC
Saga	Sequence of queue-based steps with compensation	Distributed transactions
Priority queue	Higher-priority messages processed first	Mixed-urgency workloads

Monitoring Metrics

Metric	What It Tells You	Alert Threshold
Queue depth	Messages waiting to be processed	Growing over time = consumer can’t keep up
Processing latency	Time from enqueue to dequeue	> SLA target
Consumer lag	How far behind the consumer is	Growing lag = falling behind
Error rate	Failed message processing	> 1% sustained
DLQ depth	Messages that failed permanently	> 0 (every DLQ message is an issue)
Throughput	Messages produced/consumed per second	Dropping below expected rate

Anti-Patterns

Anti-Pattern	Problem	Fix
No DLQ	Failed messages silently disappear	Always configure DLQ with alerting
No idempotency	Duplicate processing causes data corruption	Idempotency keys on every message
Unbounded retries	Poison messages retry forever	Max retry count + exponential backoff + DLQ
Queue as database	Storing state in queue messages	Use a database for state, queue for events
Too large messages	Queue performance degrades	Store payload in S3/blob, pass reference in message
No monitoring	Silent queue failures	Monitor depth, lag, error rate, DLQ depth

Checklist

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For messaging architecture consulting, visit garnetgrid.com. :::