ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

CDC Pipeline Architecture

Capture and stream database changes in real-time using Change Data Capture. Covers Debezium setup, log-based CDC, outbox pattern, event transformation, exactly-once delivery, and the patterns that turn database mutations into reliable event streams.

Change Data Capture (CDC) turns your database’s transaction log into a stream of events. Every INSERT, UPDATE, and DELETE becomes a message on a topic — without modifying application code, without dual writes, without polling. CDC is the foundation for real-time analytics, event-driven microservices, and cache invalidation.


CDC Approaches

Polling (Timestamp-based):
  How: Query WHERE updated_at > last_poll_time
  Pros: Simple, works on any database
  Cons: Misses deletes, high DB load, not real-time
  
  SELECT * FROM orders WHERE updated_at > '2024-01-15T10:00:00Z'
  -- Run every 30 seconds
  -- Miss: Hard deletes (row gone, can't find it)
  -- Miss: Updates within same second as last poll

Trigger-based:
  How: Database triggers write changes to a shadow table
  Pros: Captures all operations including deletes
  Cons: Adds overhead to every write, complex maintenance
  
  CREATE TRIGGER order_cdc AFTER INSERT OR UPDATE OR DELETE
  ON orders FOR EACH ROW EXECUTE capture_change();

Log-based (Best):
  How: Read the database's transaction log (WAL, binlog)
  Pros: Zero overhead on writes, captures everything, real-time
  Cons: Requires DB-specific tooling, log retention management
  
  Tools: Debezium (most popular), AWS DMS, Fivetran, Airbyte

Debezium Pipeline

# docker-compose.yml: Full CDC pipeline
services:
  postgres:
    image: postgres:16
    environment:
      POSTGRES_PASSWORD: postgres
    command: >
      postgres
      -c wal_level=logical
      -c max_replication_slots=4
      -c max_wal_senders=4

  kafka:
    image: confluentinc/cp-kafka:7.5.0
    # ... Kafka config

  debezium:
    image: debezium/connect:2.5
    environment:
      BOOTSTRAP_SERVERS: kafka:9092
      GROUP_ID: "cdc-connect"
      CONFIG_STORAGE_TOPIC: "connect-configs"
      OFFSET_STORAGE_TOPIC: "connect-offsets"

# Register PostgreSQL connector
# POST /connectors
connector_config:
  name: "orders-connector"
  config:
    connector.class: "io.debezium.connector.postgresql.PostgresConnector"
    database.hostname: "postgres"
    database.port: "5432"
    database.user: "postgres"
    database.password: "postgres"
    database.dbname: "mydb"
    topic.prefix: "cdc"
    table.include.list: "public.orders,public.customers"
    slot.name: "debezium_slot"
    publication.name: "debezium_pub"
    
    # Produce events to: cdc.public.orders, cdc.public.customers

Outbox Pattern

# Problem: Dual write = inconsistency
# Save order to DB AND publish event = what if one fails?

# Solution: Outbox pattern — single write, CDC publishes

class OrderService:
    def create_order(self, order_data):
        with self.db.transaction() as tx:
            # 1. Insert order
            order = tx.insert("orders", order_data)
            
            # 2. Insert event into outbox table (same transaction!)
            tx.insert("outbox", {
                "aggregate_type": "Order",
                "aggregate_id": order.id,
                "event_type": "OrderCreated",
                "payload": json.dumps({
                    "order_id": order.id,
                    "customer_id": order.customer_id,
                    "total": str(order.total),
                    "items": order.items,
                }),
            })
            
            # Both writes in same transaction = atomic
            # If either fails, both roll back
    
    # CDC reads the outbox table and publishes events to Kafka
    # Debezium connector watches: table.include.list = "public.outbox"
    # Event published GUARANTEED (because it is in the DB)
    
    # After CDC publishes, a cleanup job deletes old outbox rows

Anti-Patterns

Anti-PatternConsequenceFix
Polling-based CDCMisses deletes, high DB load, latencyLog-based CDC (Debezium)
Dual writes (DB + queue)Inconsistency when one failsOutbox pattern (single atomic write)
No schema registrySchema changes break consumersUse Avro/Protobuf with schema registry
Unbounded WAL retentionDisk fills up, DB crashesSet WAL retention limits, monitor slot lag
Process CDC events without idempotencyDuplicates cause data corruptionIdempotent consumers with deduplication

CDC is the cleanest way to get data out of your database and into the rest of your system. It requires zero application code changes, captures every mutation, and produces events in real-time — if you set it up correctly.

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →