Message queues decouple producers from consumers, allowing systems to handle traffic spikes, process work asynchronously, and survive temporary failures. The difference between a well-designed queue system and a poorly designed one is the difference between graceful degradation and silent data loss.
When to Use Queues vs. Direct Calls
| Use Queues When | Use Direct Calls When |
|---|
| Work can be processed asynchronously | Response is needed immediately |
| Producer and consumer have different throughput rates | Low latency is critical |
| Downstream service may be temporarily unavailable | Target service is always available |
| Work is expensive and should be rate-limited | Processing is fast and cheap |
| You need durability (work survives crashes) | Fire-and-forget is acceptable |
| Multiple consumers need the same message | Single target consumer |
Queue Technology Comparison
| Technology | Model | Best For | Ordering |
|---|
| RabbitMQ | Message broker | Task queues, routing patterns | Per-queue FIFO |
| Apache Kafka | Event streaming | Event sourcing, high-throughput logs | Per-partition |
| AWS SQS | Managed queue | Serverless, AWS-native | Best-effort (FIFO available) |
| AWS SNS + SQS | Fan-out | One-to-many notification | Per-subscription |
| Redis Streams | Lightweight streaming | Simple streaming, existing Redis | Per-stream |
| Azure Service Bus | Managed broker | Enterprise, .NET integration | FIFO with sessions |
| Google Pub/Sub | Managed pub/sub | GCP-native, global distribution | Per-subscription |
| NATS | Lightweight messaging | Microservices, low-latency | Per-subject |
Delivery Guarantees
| Guarantee | Meaning | Tradeoff |
|---|
| At-most-once | Message delivered 0 or 1 times | Fast, but may lose messages |
| At-least-once | Message delivered 1+ times | Reliable, but may duplicate |
| Exactly-once | Message delivered exactly 1 time | Expensive, complex, often impossible at scale |
Making At-Least-Once Safe (Idempotency)
Producer:
message = {
id: "order-12345-payment", // Idempotency key
action: "process_payment",
payload: { order_id: 12345, amount: 99.99 }
}
Consumer:
if already_processed(message.id):
acknowledge and skip // Duplicate detected
else:
process(message.payload)
mark_processed(message.id)
acknowledge
Dead Letter Queue (DLQ) Design
Main Queue → Consumer attempts processing
↓ (success) → Acknowledge, done
↓ (failure) → Retry (up to N times)
↓ (max retries exceeded) → Dead Letter Queue
↓ → Alert, manual investigation, replay
| Configuration | Recommended Value | Rationale |
|---|
| Max retries | 3-5 | Enough for transient errors, not infinite loops |
| Retry backoff | Exponential (1s, 5s, 25s) | Gives downstream time to recover |
| DLQ retention | 14 days | Time for investigation and replay |
| DLQ alerting | Immediate on first message | Every DLQ message is a failure worth investigating |
Consumer Patterns
| Pattern | Description | When to Use |
|---|
| Competing consumers | Multiple consumers on same queue | Scale processing horizontally |
| Fan-out | One message to multiple queues | Multiple services need same event |
| Request-reply | Response sent to reply queue | Async RPC |
| Saga | Sequence of queue-based steps with compensation | Distributed transactions |
| Priority queue | Higher-priority messages processed first | Mixed-urgency workloads |
Monitoring Metrics
| Metric | What It Tells You | Alert Threshold |
|---|
| Queue depth | Messages waiting to be processed | Growing over time = consumer can’t keep up |
| Processing latency | Time from enqueue to dequeue | > SLA target |
| Consumer lag | How far behind the consumer is | Growing lag = falling behind |
| Error rate | Failed message processing | > 1% sustained |
| DLQ depth | Messages that failed permanently | > 0 (every DLQ message is an issue) |
| Throughput | Messages produced/consumed per second | Dropping below expected rate |
Anti-Patterns
| Anti-Pattern | Problem | Fix |
|---|
| No DLQ | Failed messages silently disappear | Always configure DLQ with alerting |
| No idempotency | Duplicate processing causes data corruption | Idempotency keys on every message |
| Unbounded retries | Poison messages retry forever | Max retry count + exponential backoff + DLQ |
| Queue as database | Storing state in queue messages | Use a database for state, queue for events |
| Too large messages | Queue performance degrades | Store payload in S3/blob, pass reference in message |
| No monitoring | Silent queue failures | Monitor depth, lag, error rate, DLQ depth |
Checklist
:::note[Source]
This guide is derived from operational intelligence at Garnet Grid Consulting. For messaging architecture consulting, visit garnetgrid.com.
:::
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting
Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.
View Full Profile →