CDC Pipeline Architecture
Capture and stream database changes in real-time using Change Data Capture. Covers Debezium setup, log-based CDC, outbox pattern, event transformation, exactly-once delivery, and the patterns that turn database mutations into reliable event streams.
Change Data Capture (CDC) turns your database’s transaction log into a stream of events. Every INSERT, UPDATE, and DELETE becomes a message on a topic — without modifying application code, without dual writes, without polling. CDC is the foundation for real-time analytics, event-driven microservices, and cache invalidation.
CDC Approaches
Polling (Timestamp-based):
How: Query WHERE updated_at > last_poll_time
Pros: Simple, works on any database
Cons: Misses deletes, high DB load, not real-time
SELECT * FROM orders WHERE updated_at > '2024-01-15T10:00:00Z'
-- Run every 30 seconds
-- Miss: Hard deletes (row gone, can't find it)
-- Miss: Updates within same second as last poll
Trigger-based:
How: Database triggers write changes to a shadow table
Pros: Captures all operations including deletes
Cons: Adds overhead to every write, complex maintenance
CREATE TRIGGER order_cdc AFTER INSERT OR UPDATE OR DELETE
ON orders FOR EACH ROW EXECUTE capture_change();
Log-based (Best):
How: Read the database's transaction log (WAL, binlog)
Pros: Zero overhead on writes, captures everything, real-time
Cons: Requires DB-specific tooling, log retention management
Tools: Debezium (most popular), AWS DMS, Fivetran, Airbyte
Debezium Pipeline
# docker-compose.yml: Full CDC pipeline
services:
postgres:
image: postgres:16
environment:
POSTGRES_PASSWORD: postgres
command: >
postgres
-c wal_level=logical
-c max_replication_slots=4
-c max_wal_senders=4
kafka:
image: confluentinc/cp-kafka:7.5.0
# ... Kafka config
debezium:
image: debezium/connect:2.5
environment:
BOOTSTRAP_SERVERS: kafka:9092
GROUP_ID: "cdc-connect"
CONFIG_STORAGE_TOPIC: "connect-configs"
OFFSET_STORAGE_TOPIC: "connect-offsets"
# Register PostgreSQL connector
# POST /connectors
connector_config:
name: "orders-connector"
config:
connector.class: "io.debezium.connector.postgresql.PostgresConnector"
database.hostname: "postgres"
database.port: "5432"
database.user: "postgres"
database.password: "postgres"
database.dbname: "mydb"
topic.prefix: "cdc"
table.include.list: "public.orders,public.customers"
slot.name: "debezium_slot"
publication.name: "debezium_pub"
# Produce events to: cdc.public.orders, cdc.public.customers
Outbox Pattern
# Problem: Dual write = inconsistency
# Save order to DB AND publish event = what if one fails?
# Solution: Outbox pattern — single write, CDC publishes
class OrderService:
def create_order(self, order_data):
with self.db.transaction() as tx:
# 1. Insert order
order = tx.insert("orders", order_data)
# 2. Insert event into outbox table (same transaction!)
tx.insert("outbox", {
"aggregate_type": "Order",
"aggregate_id": order.id,
"event_type": "OrderCreated",
"payload": json.dumps({
"order_id": order.id,
"customer_id": order.customer_id,
"total": str(order.total),
"items": order.items,
}),
})
# Both writes in same transaction = atomic
# If either fails, both roll back
# CDC reads the outbox table and publishes events to Kafka
# Debezium connector watches: table.include.list = "public.outbox"
# Event published GUARANTEED (because it is in the DB)
# After CDC publishes, a cleanup job deletes old outbox rows
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| Polling-based CDC | Misses deletes, high DB load, latency | Log-based CDC (Debezium) |
| Dual writes (DB + queue) | Inconsistency when one fails | Outbox pattern (single atomic write) |
| No schema registry | Schema changes break consumers | Use Avro/Protobuf with schema registry |
| Unbounded WAL retention | Disk fills up, DB crashes | Set WAL retention limits, monitor slot lag |
| Process CDC events without idempotency | Duplicates cause data corruption | Idempotent consumers with deduplication |
CDC is the cleanest way to get data out of your database and into the rest of your system. It requires zero application code changes, captures every mutation, and produces events in real-time — if you set it up correctly.