How to Implement Event-Driven Architecture
Design and build event-driven systems. Covers event sourcing, CQRS, message brokers, saga patterns, idempotency, and common pitfalls.
Event-driven architecture decouples services by communicating through events instead of direct calls. It enables scalability, resilience, and real-time processing — but only if you handle the complexity correctly. The difference between a well-implemented event-driven system and a poorly implemented one is the difference between a resilient, scalable platform and a debugging nightmare where events are lost, duplicated, or processed out of order.
This guide covers when to use event-driven architecture, how to choose a message broker, the patterns that make it work (event sourcing, CQRS, sagas), and the pitfalls that sink most implementations.
When to Use Event-Driven
✅ Good Fit
| Scenario | Example | Why EDA Works |
|---|---|---|
| Decoupled microservices | Order placed → inventory, shipping, email all react independently | Services evolve independently |
| Real-time data processing | User action → analytics, recommendations update instantly | Low-latency, high-throughput |
| Audit trail / compliance | Every state change persisted as an immutable event | Complete history, regulatory compliance |
| Async workloads | File uploaded → resize, scan, optimize in background | Non-blocking, parallelizable |
| Multi-team systems | Team A produces events, Team B consumes without coordination | Loose coupling, team autonomy |
❌ Bad Fit
| Scenario | Why Not | Better Approach |
|---|---|---|
| Simple CRUD apps | Over-engineering for 10 endpoints | Direct API calls |
| Synchronous queries | User expects instant response, not “processing” | Request-response pattern |
| Strong consistency required | Events are eventually consistent by nature | Database transactions |
| Small team (< 5 devs) | Operational overhead exceeds value | Modular monolith |
| Prototyping / MVPs | Adds complexity when you need speed | Direct function calls |
Core Patterns
Event Sourcing
Store events as the source of truth instead of current state. Current state is derived by replaying events.
# Store events, derive state
class OrderEventStore:
def __init__(self):
self.events = []
def append(self, event):
event["timestamp"] = datetime.utcnow().isoformat()
event["version"] = len(self.events) + 1
self.events.append(event)
def get_state(self, order_id):
"""Rebuild current state from events"""
state = {"status": "unknown", "items": [], "total": 0}
for event in self.events:
if event["order_id"] != order_id:
continue
if event["type"] == "OrderCreated":
state["status"] = "created"
state["customer"] = event["customer_id"]
elif event["type"] == "ItemAdded":
state["items"].append(event["item"])
state["total"] += event["price"]
elif event["type"] == "OrderPaid":
state["status"] = "paid"
elif event["type"] == "OrderShipped":
state["status"] = "shipped"
return state
:::tip[When to Use Event Sourcing] Use event sourcing when you need a complete audit trail, the ability to rebuild state at any point in time, or debugging complex workflows by replaying events. Don’t use it for simple CRUD — the overhead isn’t worth it. :::
CQRS (Command Query Responsibility Segregation)
Separate the write path (commands) from the read path (queries). Each is optimized independently.
Commands (Write) Events Queries (Read)
┌──────────────┐ ┌────────────────┐ ┌──────────────┐
│ CreateOrder │────▶│ OrderCreated │────▶│ Order List │
│ AddItem │ │ ItemAdded │ │ (Denormalized│
│ Pay │ │ OrderPaid │ │ read model) │
│ Ship │ │ OrderShipped │ │ │
└──────────────┘ └────────────────┘ └──────────────┘
(Write DB) (Event Store) (Read DB)
Normalized Append-only Optimized for
for writes immutable queries
When to Combine CQRS + Event Sourcing
| Pattern | Complexity | Use When |
|---|---|---|
| Simple CRUD | Low | < 10K users, simple domain |
| CQRS only (no event sourcing) | Medium | Different read/write patterns, high read volume |
| Event sourcing only (no CQRS) | Medium | Audit trail needed, simple queries |
| CQRS + Event sourcing | High | Financial systems, complex domains, regulatory requirements |
Step 1: Choose a Message Broker
| Broker | Throughput | Ordering | Retention | Best For | Managed Options |
|---|---|---|---|---|---|
| Apache Kafka | Very High | Partition-level | Days to forever | High-volume streaming, event sourcing | Confluent, MSK, Event Hubs |
| AWS SQS | High | FIFO available | 14 days max | Simple task queuing, decoupling | AWS-native |
| RabbitMQ | Medium-High | Per-queue | Until consumed | Complex routing (fanout, topic, headers) | CloudAMQP |
| AWS EventBridge | Medium | No guarantee | 24 hours | AWS event routing, serverless | AWS-native |
| Redis Streams | Very High | Per-stream | Configurable | Low-latency, simple streaming | ElastiCache |
| NATS | Very High | Per-subject | Configurable | Cloud-native, lightweight | Synadia |
Selection Criteria
Need to replay events from days/weeks ago?
├── Yes → Kafka (configurable retention)
└── No
├── Need complex routing (fanout, topic exchanges)?
│ └── RabbitMQ
├── Simple task queue with at-least-once delivery?
│ └── SQS (simplest to operate)
├── Ultralow latency (< 1ms)?
│ └── Redis Streams or NATS
└── AWS-native serverless integration?
└── EventBridge
Step 2: Design Event Schemas
Events should be self-describing, versioned, and backward-compatible.
{
"event_id": "evt-550e8400-e29b",
"event_type": "OrderCreated",
"event_version": "1.2",
"timestamp": "2025-02-15T14:30:00Z",
"source": "order-service",
"correlation_id": "req-abc-123",
"data": {
"order_id": "ord-789",
"customer_id": "cust-456",
"items": [
{"product_id": "prod-1", "quantity": 2, "price": 29.99}
],
"total": 59.98
},
"metadata": {
"user_agent": "web",
"ip_address": "192.168.1.1"
}
}
Schema Evolution Rules
| Change Type | Backward Compatible? | Example |
|---|---|---|
| Add optional field | ✅ Yes | Add discount_code to order event |
| Add required field | ❌ No | Add shipping_address (existing events don’t have it) |
| Remove field | ❌ No | Remove customer_email |
| Rename field | ❌ No | Rename total to order_total |
| Change field type | ❌ No | Change price from int to string |
Step 3: Implement Idempotent Consumers
# Every event consumer MUST be idempotent
# (processing the same event twice must produce the same result)
class IdempotentConsumer:
def __init__(self, db):
self.db = db
def process(self, event):
event_id = event["id"]
# Check if already processed (deduplication)
if self.db.execute(
"SELECT 1 FROM processed_events WHERE event_id = %s",
(event_id,)
).fetchone():
print(f"Event {event_id} already processed, skipping")
return
# Process the event
self._handle(event)
# Mark as processed (same transaction as the business logic!)
self.db.execute(
"INSERT INTO processed_events (event_id, processed_at) VALUES (%s, NOW())",
(event_id,)
)
self.db.commit()
def _handle(self, event):
if event["type"] == "OrderPaid":
# Send confirmation email, update inventory, etc.
pass
Step 4: Saga Pattern for Distributed Transactions
When a business process spans multiple services, use sagas instead of distributed transactions.
# Orchestration saga for order processing
class OrderSaga:
steps = [
{"action": "reserve_inventory", "compensate": "release_inventory"},
{"action": "charge_payment", "compensate": "refund_payment"},
{"action": "create_shipment", "compensate": "cancel_shipment"},
{"action": "send_confirmation", "compensate": "send_cancellation"},
]
async def execute(self, order):
completed = []
for step in self.steps:
try:
await getattr(self, step["action"])(order)
completed.append(step)
except Exception as e:
# Compensate in reverse order
for comp_step in reversed(completed):
await getattr(self, comp_step["compensate"])(order)
raise SagaFailed(f"Step {step['action']} failed: {e}")
Orchestration vs Choreography
| Approach | How It Works | Pros | Cons |
|---|---|---|---|
| Orchestration | Central coordinator tells each service what to do | Easy to understand, single source of truth | Single point of failure, coupling |
| Choreography | Each service reacts to events and publishes its own | Decoupled, no central coordinator | Hard to debug, implicit flow |
Rule of thumb: Use orchestration for < 5 steps, choreography for highly decoupled, independently evolving services.
Common Pitfalls
| Pitfall | Problem | Solution |
|---|---|---|
| No idempotency | Duplicate charges, double emails | Deduplication by event ID (idempotency key) |
| No dead letter queue | Failed messages silently lost | Configure DLQ on every queue, alert on DLQ depth |
| No schema versioning | New event version breaks all consumers | Schema registry (Confluent, Protobuf, Avro) |
| Eventual consistency surprise | Users see stale data and file support tickets | UI shows “processing” state, polling for updates |
| Event ordering assumptions | Payment processed before order created | Use partition keys (same entity → same partition) |
| Unbounded event size | Large payloads slow down the broker | Store data in S3/DB, put reference in event |
| Missing correlation IDs | Can’t trace a request across services | Include correlation_id in every event |
| Chat-based debugging | ”What events did order X produce?” is unanswerable | Build event catalog and search tooling |
Monitoring Event-Driven Systems
| Metric | What to Watch | Alert Threshold |
|---|---|---|
| Consumer lag | Messages behind real-time | > 1,000 messages or > 5 minutes |
| DLQ depth | Failed messages accumulating | > 0 (investigate every failure) |
| Processing latency | Time from event publish to consumer processing | > P95 SLA |
| Throughput | Events/second per topic | Sudden drop (> 50% from baseline) |
| Consumer group health | Active consumers per group | Fewer consumers than expected |
Event-Driven Architecture Checklist
- Events defined with clear schema (name, version, payload, metadata)
- Schema evolution strategy documented (backward compatibility rules)
- Message broker selected, deployed, and load-tested
- All consumers are idempotent (deduplication by event ID)
- Dead letter queues configured on every queue/topic
- Saga/compensation pattern for multi-step workflows
- Correlation IDs propagated across all events
- Monitoring: consumer lag, DLQ depth, processing latency
- Event replay capability tested (can you reprocess from offset?)
- Event catalog documented (producers, consumers, schemas)
- Load testing completed with 2x expected peak traffic
:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For architecture consulting, visit garnetgrid.com. :::