Change Data Capture (CDC) Patterns
Implement CDC for real-time data synchronization. Covers Debezium, log-based CDC, query-based CDC, outbox pattern, event sourcing, and CDC pipeline architecture.
Change Data Capture (CDC) captures row-level changes (inserts, updates, deletes) from databases and streams them as events. Instead of nightly batch exports that are stale by morning, CDC delivers changes in near real-time — enabling live dashboards, immediate cache invalidation, event-driven microservices, and synchronized data stores.
CDC Methods
| Method | How It Works | Latency | Impact on Source |
|---|---|---|---|
| Log-based | Read database transaction log (WAL/binlog) | Seconds | Minimal (no queries) |
| Query-based | Poll tables for changes (timestamp/version) | Minutes | Medium (queries load) |
| Trigger-based | Database triggers write to change table | Immediate | High (trigger overhead) |
| Application-level | Application emits events on write | Immediate | None on DB (code change) |
Debezium (Log-Based CDC)
// Debezium connector configuration for PostgreSQL
{
"name": "orders-cdc-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "postgres",
"database.port": "5432",
"database.user": "cdc_user",
"database.password": "${CDC_PASSWORD}",
"database.dbname": "commerce",
"topic.prefix": "cdc",
"table.include.list": "public.orders,public.order_items,public.customers",
"plugin.name": "pgoutput",
"slot.name": "debezium_orders",
"publication.name": "dbz_publication",
"snapshot.mode": "initial",
"transforms": "route",
"transforms.route.type": "io.debezium.transforms.ByLogicalTableRouter",
"transforms.route.topic.regex": "cdc\\.commerce\\.(.*)",
"transforms.route.topic.replacement": "commerce.$1.events"
}
}
CDC Event Structure
{
"schema": { ... },
"payload": {
"before": {
"order_id": "ORD-123",
"status": "pending",
"amount": 99.99
},
"after": {
"order_id": "ORD-123",
"status": "shipped",
"amount": 99.99
},
"source": {
"version": "2.5.0",
"connector": "postgresql",
"name": "cdc",
"ts_ms": 1709318400000,
"db": "commerce",
"schema": "public",
"table": "orders",
"txId": 45678,
"lsn": 123456789
},
"op": "u",
"ts_ms": 1709318400100
}
}
Transactional Outbox Pattern
For microservices that need to update a database AND publish an event atomically:
# Instead of:
# 1. Update database
# 2. Publish to Kafka ← Can fail after DB commit!
# Use outbox:
# 1. Update database + insert into outbox table (same transaction)
# 2. CDC reads outbox table and publishes to Kafka
async def create_order(order):
async with database.transaction():
# Business operation
await database.execute(
"INSERT INTO orders (id, customer_id, amount, status) VALUES ($1, $2, $3, $4)",
order.id, order.customer_id, order.amount, "pending"
)
# Outbox entry (same transaction — atomic!)
await database.execute(
"""INSERT INTO outbox (id, aggregate_type, aggregate_id, event_type, payload)
VALUES ($1, $2, $3, $4, $5)""",
uuid4(), "Order", order.id, "OrderCreated",
json.dumps({"order_id": order.id, "amount": order.amount})
)
# Debezium CDC picks up the outbox insert and publishes to Kafka
# No dual-write problem!
CDC Pipeline Architecture
Source Databases
├── PostgreSQL (orders) ──┐
├── MySQL (inventory) ───┼── Debezium ──▶ Kafka Topics
└── MongoDB (users) ───┘ │
├──▶ Data Warehouse (Snowflake/BQ)
├──▶ Elasticsearch (search index)
├──▶ Redis (cache invalidation)
└──▶ Downstream microservices
CDC vs Alternatives
| Approach | Freshness | Complexity | Source Impact | Deletes Captured |
|---|---|---|---|---|
| Log-based CDC | Seconds | Medium | Minimal | ✅ |
| Batch ETL | Hours | Low | High (full scan) | ❌ (without soft-delete) |
| Query polling | Minutes | Low | Medium | ❌ |
| Application events | Immediate | High (code change) | None | ✅ |
Anti-Patterns
| Anti-Pattern | Problem | Fix |
|---|---|---|
| Dual writes | Update DB + publish event = inconsistency risk | Outbox pattern + CDC |
| Full table scans for sync | Expensive, slow, misses deletes | Log-based CDC |
| No schema evolution | CDC events break consumers on schema change | Schema registry + compatibility |
| No monitoring | Replication slot bloat, lag undetected | Monitor replication lag, slot size |
| Tight coupling | Consumers depend on exact DB schema | Transform CDC events into domain events |
Checklist
- CDC method selected (log-based recommended for production)
- Debezium or equivalent configured with correct plugin
- Replication slot monitored for lag and disk usage
- Outbox pattern for services needing atomic DB + event publish
- Schema evolution strategy for CDC events
- Dead letter queue for failed event processing
- Monitoring: CDC lag, connector status, error rates
- Snapshot strategy: initial load for new consumers
- Access control: CDC user has minimal required permissions
:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For CDC consulting, visit garnetgrid.com. :::