Real-Time Streaming with Kafka: Architecture Guide

Apache Kafka is the backbone of real-time data infrastructure at companies processing millions of events per second. Netflix, LinkedIn, Uber, and thousands of enterprises rely on Kafka for event streaming, log aggregation, CDC (Change Data Capture), and real-time analytics.

But Kafka is infrastructure, not a solution — a misconfigured topic, a lagging consumer group, or a bad partitioning strategy can turn your real-time pipeline into a real-time bottleneck. This guide covers topic design, producer and consumer patterns, Kafka Connect, monitoring, and the operational knowledge needed to run Kafka in production without surprises.

Core Concepts

Producers → Topics (partitioned, replicated) → Consumer Groups → Downstream Systems
               │
               ├── Partition 0: [msg1, msg4, msg7, ...]  ── Consumer A
               ├── Partition 1: [msg2, msg5, msg8, ...]  ── Consumer B
               └── Partition 2: [msg3, msg6, msg9, ...]  ── Consumer C

Concept	Description	Key Detail
Topic	Named stream of records (like a table)	Immutable append-only log
Partition	Ordered, immutable sequence within a topic	Unit of parallelism
Producer	Publishes records to topics	Controls partition assignment via key
Consumer Group	Set of consumers sharing work across partitions	Each partition assigned to exactly one consumer
Offset	Position of a consumer within a partition	Consumers track their own position
Broker	Kafka server that stores and serves data	Typically 3+ brokers per cluster
Replication Factor	Number of copies per partition	RF=3 means 3 brokers have the data
ISR (In-Sync Replicas)	Replicas that are caught up with the leader	Dropped ISR = potential data loss risk

When to Use Kafka vs Alternatives

Use Case	Kafka	SQS/SNS	RabbitMQ	Redis Streams
High-throughput event streaming	✅ Best	❌	⚠️	⚠️
Log aggregation	✅ Best	❌	❌	❌
Simple task queue	⚠️ Overkill	✅ Best	✅	✅
CDC (Change Data Capture)	✅ Best	❌	❌	❌
Request/reply pattern	❌ Wrong tool	⚠️	✅ Best	✅
Event replay/reprocessing	✅ Best	❌	❌	⚠️

Topic Design

Naming Convention

A consistent naming convention prevents confusion as topics multiply:

<domain>.<entity>.<event-type>

Examples:
  orders.created          ← New order placed
  orders.shipped          ← Order shipped
  payments.processed      ← Payment completed
  payments.failed         ← Payment failed
  users.profile-updated   ← User profile changed
  inventory.stock-changed ← Stock level update
  analytics.page-viewed   ← Page view event
  cdc.public.orders       ← CDC from PostgreSQL orders table

Partitioning Strategy

Partitioning determines parallelism and ordering. Choose your partition key carefully — you cannot change it later without creating a new topic.

# Partition by entity ID — all events for an entity go to same partition
producer.send(
    topic="orders.created",
    key=order.customer_id.encode(),  # Partition key
    value=json.dumps(order).encode()
)

# This ensures:
# 1. All orders for customer_123 go to the SAME partition
# 2. Events for customer_123 are processed in ORDER
# 3. Different customers are distributed across partitions for parallelism

Partition Count	Throughput	Ordering	When to Use
1	Low, but total ordering	Global order guaranteed	Strict sequential processing
6-12	Good for most workloads	Per-key ordering	Default starting point
50-100	High throughput	Per-key ordering	High-volume production
500+	Maximum throughput	Per-key ordering	Extreme scale (Netflix-level)

Rule: Start with 6-12 partitions. You can increase partitions later, but be aware that increasing partitions redistributes keys — messages for the same key may land on different partitions after the change.

Producer Patterns

from confluent_kafka import Producer

conf = {
    'bootstrap.servers': 'kafka-1:9092,kafka-2:9092,kafka-3:9092',
    'acks': 'all',                    # Wait for all ISR replicas to acknowledge
    'enable.idempotence': True,       # Exactly-once per partition (prevents duplicates on retry)
    'max.in.flight.requests.per.connection': 5,  # Max with idempotence
    'retries': 2147483647,            # Infinite retries (idempotence handles dedup)
    'linger.ms': 5,                   # Wait 5ms to batch messages
    'batch.size': 32768,              # 32KB batch size
    'compression.type': 'lz4',        # LZ4 for best speed; zstd for best ratio
}

producer = Producer(conf)

def on_delivery(err, msg):
    if err:
        log.error(f"Delivery failed for {msg.key()}: {err}")
        # Alert, retry, or send to dead letter topic
    else:
        log.debug(f"Delivered to {msg.topic()}[{msg.partition()}] @ offset {msg.offset()}")

producer.produce(
    topic='orders.created',
    key=order_id.encode(),
    value=json.dumps(event).encode(),
    callback=on_delivery
)
producer.flush()  # Block until all messages are delivered

Producer Configuration Guide

Setting	Default	Recommended	Why
`acks`	`1`	`all`	Prevents data loss on broker failure
`enable.idempotence`	`false`	`true`	Prevents duplicate messages on retry
`compression.type`	`none`	`lz4` or `zstd`	40-60% bandwidth reduction
`linger.ms`	`0`	`5-50`	Batching improves throughput dramatically
`batch.size`	`16384`	`32768-65536`	Larger batches = fewer requests

Consumer Patterns

from confluent_kafka import Consumer

conf = {
    'bootstrap.servers': 'kafka-1:9092,kafka-2:9092,kafka-3:9092',
    'group.id': 'order-processor',
    'auto.offset.reset': 'earliest',    # Start from beginning if no offset stored
    'enable.auto.commit': False,        # Manual commit for exactly-once processing
    'max.poll.interval.ms': 300000,     # 5 min max processing time per batch
    'session.timeout.ms': 45000,        # 45 sec heartbeat timeout
    'fetch.min.bytes': 1024,            # Wait for at least 1KB to reduce requests
}

consumer = Consumer(conf)
consumer.subscribe(['orders.created'])

while True:
    msg = consumer.poll(1.0)
    if msg is None:
        continue
    if msg.error():
        handle_error(msg.error())
        continue

    try:
        event = json.loads(msg.value())
        process_order(event)
        consumer.commit(msg)  # Commit ONLY after successful processing
    except RetryableError as e:
        log.warning(f"Retryable error, will retry: {e}")
        # Don't commit — message will be reprocessed
    except Exception as e:
        send_to_dlq(msg, error=str(e))  # Dead letter queue for poison pills
        consumer.commit(msg)  # Commit to advance past the bad message

Consumer Group Scaling

Consumers in Group	Partitions	Assignment	Notes
3 consumers	6 partitions	2 partitions each	Balanced
6 consumers	6 partitions	1 partition each	Maximum parallelism
9 consumers	6 partitions	6 active, 3 idle	Wasted resources — max consumers = partition count

Kafka Connect

Pre-built connectors for moving data in and out of Kafka without writing custom code. Use Debezium for CDC (capturing database changes as Kafka events).

{
  "name": "postgres-source",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "db.internal",
    "database.port": "5432",
    "database.dbname": "orders_db",
    "database.user": "kafka_connect",
    "topic.prefix": "cdc",
    "table.include.list": "public.orders,public.order_items",
    "slot.name": "kafka_connect_slot",
    "publication.name": "kafka_connect_pub",
    "transforms": "route",
    "transforms.route.type": "org.apache.kafka.connect.transforms.RegexRouter",
    "transforms.route.regex": "cdc.public.(.*)",
    "transforms.route.replacement": "orders.$1"
  }
}

Direction	Popular Connectors	Use Case
Source → Kafka	Debezium (CDC)	Real-time database replication
Source → Kafka	JDBC Source	Periodic database polling
Source → Kafka	S3 Source	File ingestion
Kafka → Sink	Elasticsearch	Search indexing
Kafka → Sink	S3 Sink	Data lake ingestion
Kafka → Sink	Snowflake/BigQuery	Data warehouse loading
Kafka → Sink	JDBC Sink	Database synchronization

Operational Best Practices

Monitoring

Metric	Alert Threshold	Severity	Tool
Consumer lag	> 10,000 messages	Warning	Burrow, Kafka Exporter
Consumer lag	> 100,000 messages	Critical	Burrow, Kafka Exporter
Under-replicated partitions	> 0 for > 5 min	Critical	JMX / Prometheus
Request latency (p99)	> 100ms	Warning	JMX / Prometheus
Disk usage per broker	> 75%	Warning	Node exporter
ISR (In-Sync Replicas) shrink	Any shrink event	Critical	JMX / Prometheus
Active controller count	!= 1	Critical	JMX / Prometheus

Retention & Storage

# Topic-level configuration
retention.ms=604800000          # 7 days (default for event topics)
retention.bytes=10737418240     # 10 GB per partition (cap for cost control)
cleanup.policy=delete           # Delete old segments after retention
# cleanup.policy=compact        # Keep latest value per key (changelog/state)
segment.ms=3600000              # Roll segments every hour
min.insync.replicas=2           # At least 2 replicas must acknowledge (with acks=all)

Schema Management

Always use a schema registry (Confluent Schema Registry or Apicurio) with Avro or Protobuf schemas. Without it, a producer change can break every downstream consumer.

# Schema evolution rules
# BACKWARD compatible: new schema can read old data (add fields with defaults)
# FORWARD compatible: old schema can read new data (add optional fields)
# FULL compatible: both backward and forward (safest, recommended)

Checklist

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For data engineering consulting, visit garnetgrid.com. :::