Verified by Garnet Grid

How to Implement Event-Driven Architecture

Design and build event-driven systems. Covers event sourcing, CQRS, message brokers, saga patterns, idempotency, and common pitfalls.

Event-driven architecture decouples services by communicating through events instead of direct calls. It enables scalability, resilience, and real-time processing — but only if you handle the complexity correctly. The difference between a well-implemented event-driven system and a poorly implemented one is the difference between a resilient, scalable platform and a debugging nightmare where events are lost, duplicated, or processed out of order.

This guide covers when to use event-driven architecture, how to choose a message broker, the patterns that make it work (event sourcing, CQRS, sagas), and the pitfalls that sink most implementations.


When to Use Event-Driven

✅ Good Fit

ScenarioExampleWhy EDA Works
Decoupled microservicesOrder placed → inventory, shipping, email all react independentlyServices evolve independently
Real-time data processingUser action → analytics, recommendations update instantlyLow-latency, high-throughput
Audit trail / complianceEvery state change persisted as an immutable eventComplete history, regulatory compliance
Async workloadsFile uploaded → resize, scan, optimize in backgroundNon-blocking, parallelizable
Multi-team systemsTeam A produces events, Team B consumes without coordinationLoose coupling, team autonomy

❌ Bad Fit

ScenarioWhy NotBetter Approach
Simple CRUD appsOver-engineering for 10 endpointsDirect API calls
Synchronous queriesUser expects instant response, not “processing”Request-response pattern
Strong consistency requiredEvents are eventually consistent by natureDatabase transactions
Small team (< 5 devs)Operational overhead exceeds valueModular monolith
Prototyping / MVPsAdds complexity when you need speedDirect function calls

Core Patterns

Event Sourcing

Store events as the source of truth instead of current state. Current state is derived by replaying events.

# Store events, derive state

class OrderEventStore:
    def __init__(self):
        self.events = []

    def append(self, event):
        event["timestamp"] = datetime.utcnow().isoformat()
        event["version"] = len(self.events) + 1
        self.events.append(event)

    def get_state(self, order_id):
        """Rebuild current state from events"""
        state = {"status": "unknown", "items": [], "total": 0}

        for event in self.events:
            if event["order_id"] != order_id:
                continue

            if event["type"] == "OrderCreated":
                state["status"] = "created"
                state["customer"] = event["customer_id"]
            elif event["type"] == "ItemAdded":
                state["items"].append(event["item"])
                state["total"] += event["price"]
            elif event["type"] == "OrderPaid":
                state["status"] = "paid"
            elif event["type"] == "OrderShipped":
                state["status"] = "shipped"

        return state

:::tip[When to Use Event Sourcing] Use event sourcing when you need a complete audit trail, the ability to rebuild state at any point in time, or debugging complex workflows by replaying events. Don’t use it for simple CRUD — the overhead isn’t worth it. :::

CQRS (Command Query Responsibility Segregation)

Separate the write path (commands) from the read path (queries). Each is optimized independently.

Commands (Write)              Events              Queries (Read)
┌──────────────┐     ┌────────────────┐     ┌──────────────┐
│ CreateOrder  │────▶│ OrderCreated   │────▶│ Order List   │
│ AddItem      │     │ ItemAdded      │     │ (Denormalized│
│ Pay          │     │ OrderPaid      │     │  read model) │
│ Ship         │     │ OrderShipped   │     │              │
└──────────────┘     └────────────────┘     └──────────────┘
  (Write DB)           (Event Store)          (Read DB)
  Normalized           Append-only            Optimized for
  for writes           immutable              queries

When to Combine CQRS + Event Sourcing

PatternComplexityUse When
Simple CRUDLow< 10K users, simple domain
CQRS only (no event sourcing)MediumDifferent read/write patterns, high read volume
Event sourcing only (no CQRS)MediumAudit trail needed, simple queries
CQRS + Event sourcingHighFinancial systems, complex domains, regulatory requirements

Step 1: Choose a Message Broker

BrokerThroughputOrderingRetentionBest ForManaged Options
Apache KafkaVery HighPartition-levelDays to foreverHigh-volume streaming, event sourcingConfluent, MSK, Event Hubs
AWS SQSHighFIFO available14 days maxSimple task queuing, decouplingAWS-native
RabbitMQMedium-HighPer-queueUntil consumedComplex routing (fanout, topic, headers)CloudAMQP
AWS EventBridgeMediumNo guarantee24 hoursAWS event routing, serverlessAWS-native
Redis StreamsVery HighPer-streamConfigurableLow-latency, simple streamingElastiCache
NATSVery HighPer-subjectConfigurableCloud-native, lightweightSynadia

Selection Criteria

Need to replay events from days/weeks ago?
├── Yes → Kafka (configurable retention)
└── No
    ├── Need complex routing (fanout, topic exchanges)?
    │   └── RabbitMQ
    ├── Simple task queue with at-least-once delivery?
    │   └── SQS (simplest to operate)
    ├── Ultralow latency (< 1ms)?
    │   └── Redis Streams or NATS
    └── AWS-native serverless integration?
        └── EventBridge

Step 2: Design Event Schemas

Events should be self-describing, versioned, and backward-compatible.

{
  "event_id": "evt-550e8400-e29b",
  "event_type": "OrderCreated",
  "event_version": "1.2",
  "timestamp": "2025-02-15T14:30:00Z",
  "source": "order-service",
  "correlation_id": "req-abc-123",
  "data": {
    "order_id": "ord-789",
    "customer_id": "cust-456",
    "items": [
      {"product_id": "prod-1", "quantity": 2, "price": 29.99}
    ],
    "total": 59.98
  },
  "metadata": {
    "user_agent": "web",
    "ip_address": "192.168.1.1"
  }
}

Schema Evolution Rules

Change TypeBackward Compatible?Example
Add optional field✅ YesAdd discount_code to order event
Add required field❌ NoAdd shipping_address (existing events don’t have it)
Remove field❌ NoRemove customer_email
Rename field❌ NoRename total to order_total
Change field type❌ NoChange price from int to string

Step 3: Implement Idempotent Consumers

# Every event consumer MUST be idempotent
# (processing the same event twice must produce the same result)

class IdempotentConsumer:
    def __init__(self, db):
        self.db = db

    def process(self, event):
        event_id = event["id"]

        # Check if already processed (deduplication)
        if self.db.execute(
            "SELECT 1 FROM processed_events WHERE event_id = %s",
            (event_id,)
        ).fetchone():
            print(f"Event {event_id} already processed, skipping")
            return

        # Process the event
        self._handle(event)

        # Mark as processed (same transaction as the business logic!)
        self.db.execute(
            "INSERT INTO processed_events (event_id, processed_at) VALUES (%s, NOW())",
            (event_id,)
        )
        self.db.commit()

    def _handle(self, event):
        if event["type"] == "OrderPaid":
            # Send confirmation email, update inventory, etc.
            pass

Step 4: Saga Pattern for Distributed Transactions

When a business process spans multiple services, use sagas instead of distributed transactions.

# Orchestration saga for order processing
class OrderSaga:
    steps = [
        {"action": "reserve_inventory", "compensate": "release_inventory"},
        {"action": "charge_payment", "compensate": "refund_payment"},
        {"action": "create_shipment", "compensate": "cancel_shipment"},
        {"action": "send_confirmation", "compensate": "send_cancellation"},
    ]

    async def execute(self, order):
        completed = []
        for step in self.steps:
            try:
                await getattr(self, step["action"])(order)
                completed.append(step)
            except Exception as e:
                # Compensate in reverse order
                for comp_step in reversed(completed):
                    await getattr(self, comp_step["compensate"])(order)
                raise SagaFailed(f"Step {step['action']} failed: {e}")

Orchestration vs Choreography

ApproachHow It WorksProsCons
OrchestrationCentral coordinator tells each service what to doEasy to understand, single source of truthSingle point of failure, coupling
ChoreographyEach service reacts to events and publishes its ownDecoupled, no central coordinatorHard to debug, implicit flow

Rule of thumb: Use orchestration for < 5 steps, choreography for highly decoupled, independently evolving services.


Common Pitfalls

PitfallProblemSolution
No idempotencyDuplicate charges, double emailsDeduplication by event ID (idempotency key)
No dead letter queueFailed messages silently lostConfigure DLQ on every queue, alert on DLQ depth
No schema versioningNew event version breaks all consumersSchema registry (Confluent, Protobuf, Avro)
Eventual consistency surpriseUsers see stale data and file support ticketsUI shows “processing” state, polling for updates
Event ordering assumptionsPayment processed before order createdUse partition keys (same entity → same partition)
Unbounded event sizeLarge payloads slow down the brokerStore data in S3/DB, put reference in event
Missing correlation IDsCan’t trace a request across servicesInclude correlation_id in every event
Chat-based debugging”What events did order X produce?” is unanswerableBuild event catalog and search tooling

Monitoring Event-Driven Systems

MetricWhat to WatchAlert Threshold
Consumer lagMessages behind real-time> 1,000 messages or > 5 minutes
DLQ depthFailed messages accumulating> 0 (investigate every failure)
Processing latencyTime from event publish to consumer processing> P95 SLA
ThroughputEvents/second per topicSudden drop (> 50% from baseline)
Consumer group healthActive consumers per groupFewer consumers than expected

Event-Driven Architecture Checklist

  • Events defined with clear schema (name, version, payload, metadata)
  • Schema evolution strategy documented (backward compatibility rules)
  • Message broker selected, deployed, and load-tested
  • All consumers are idempotent (deduplication by event ID)
  • Dead letter queues configured on every queue/topic
  • Saga/compensation pattern for multi-step workflows
  • Correlation IDs propagated across all events
  • Monitoring: consumer lag, DLQ depth, processing latency
  • Event replay capability tested (can you reprocess from offset?)
  • Event catalog documented (producers, consumers, schemas)
  • Load testing completed with 2x expected peak traffic

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For architecture consulting, visit garnetgrid.com. :::

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →