Real-Time Stream Processing

Stream processing handles data as it arrives rather than waiting for a batch to accumulate. When a user clicks, a sensor reports, or a transaction occurs, the event is processed within seconds — not hours. Fraud detection, real-time recommendations, and live dashboards all require stream processing.

Batch vs Stream

Batch Processing:
  Collect data → Store → Process periodically
  Latency: Minutes to hours
  Example: "Process yesterday's transactions at 2 AM"
  Tools: Spark, Hive, BigQuery
  
Stream Processing:
  Process data as it arrives
  Latency: Milliseconds to seconds
  Example: "Flag fraudulent transaction in real-time"
  Tools: Kafka Streams, Flink, Spark Structured Streaming
  
Lambda Architecture (hybrid):
  Speed layer: Real-time approximate results
  Batch layer: Correct results on delay
  Serving layer: Merge both for queries
  
Kappa Architecture (stream-first):
  Everything is a stream
  Replay stream for reprocessing
  No separate batch layer

Apache Kafka

# Kafka Producer: Publish events
from kafka import KafkaProducer
import json

producer = KafkaProducer(
    bootstrap_servers=['kafka1:9092', 'kafka2:9092'],
    value_serializer=lambda v: json.dumps(v).encode('utf-8'),
    acks='all',  # Wait for all replicas (durability guarantee)
    retries=3,
    enable_idempotence=True,  # Exactly-once production
)

def publish_order_event(order):
    producer.send(
        topic='order-events',
        key=order['id'].encode('utf-8'),  # Same key → same partition → order guarantee
        value={
            'event_type': 'ORDER_CREATED',
            'order_id': order['id'],
            'customer_id': order['customer_id'],
            'total': str(order['total']),
            'timestamp': datetime.utcnow().isoformat(),
        }
    )

# Kafka Consumer: Process events
from kafka import KafkaConsumer

consumer = KafkaConsumer(
    'order-events',
    bootstrap_servers=['kafka1:9092'],
    group_id='fraud-detection',
    auto_offset_reset='earliest',
    enable_auto_commit=False,  # Manual commit for exactly-once
    value_deserializer=lambda v: json.loads(v.decode('utf-8')),
)

for message in consumer:
    event = message.value
    result = detect_fraud(event)
    
    if result.is_fraud:
        publish_alert(event, result)
    
    consumer.commit()  # Commit after processing

Windowing

Tumbling Window:
  Fixed-size, non-overlapping windows
  [0-5min] [5-10min] [10-15min]
  Use case: "Count orders per 5-minute window"

Sliding Window:
  Fixed-size, overlapping windows with slide interval
  [0-5min] [1-6min] [2-7min] (1-min slide)
  Use case: "Average latency over 5-min sliding window"

Session Window:
  Dynamic size, based on activity gaps
  Events within gap threshold belong to same window
  Use case: "Group user clickstream into sessions"

Global Window:
  Single window for all data
  With custom triggers
  Use case: "Count unique users across all time"

Apache Flink

// Flink: Real-time fraud detection
StreamExecutionEnvironment env = 
    StreamExecutionEnvironment.getExecutionEnvironment();

// Enable exactly-once checkpointing
env.enableCheckpointing(60000);
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);

DataStream<Transaction> transactions = env
    .addSource(new FlinkKafkaConsumer<>(
        "transactions", 
        new TransactionSchema(), 
        kafkaProperties
    ));

// Pattern: flag if > $10,000 in 10 minutes from same account
DataStream<Alert> alerts = transactions
    .keyBy(Transaction::getAccountId)
    .window(SlidingEventTimeWindows.of(
        Time.minutes(10), 
        Time.minutes(1)
    ))
    .aggregate(new SumAggregator())
    .filter(sum -> sum.getTotal() > 10000)
    .map(sum -> new Alert(sum.getAccountId(), "HIGH_VALUE", sum.getTotal()));

alerts.addSink(new FlinkKafkaProducer<>("fraud-alerts", alertSchema, kafkaProperties));

env.execute("Fraud Detection Pipeline");

Anti-Patterns

Anti-Pattern	Consequence	Fix
No backpressure handling	Consumer overwhelmed, data loss	Rate limiting, buffering, scaling
At-most-once processing	Data loss on failure	Exactly-once with checkpointing
No dead letter queue	Failed events disappear	DLQ for retry and investigation
Unbounded state	Memory grows forever	State TTL, windows with eviction
No replay capability	Cannot reprocess historical data	Kafka with retention, replay from offset

Stream processing is event-driven architecture in motion. The data never stops flowing, and your system must keep up — reliably, durably, and at scale.

Batch vs Stream

Apache Kafka

Windowing

Apache Flink

Anti-Patterns

More in Data Engineering

CDC Pipeline Architecture

Change Data Capture (CDC) Patterns

Batch Processing at Scale