Real-Time Stream Processing
Build real-time data processing pipelines with Apache Kafka and Apache Flink. Covers event streams, windowing, exactly-once semantics, state management, and the patterns that make stream processing reliable at scale.
Stream processing handles data as it arrives rather than waiting for a batch to accumulate. When a user clicks, a sensor reports, or a transaction occurs, the event is processed within seconds — not hours. Fraud detection, real-time recommendations, and live dashboards all require stream processing.
Batch vs Stream
Batch Processing:
Collect data → Store → Process periodically
Latency: Minutes to hours
Example: "Process yesterday's transactions at 2 AM"
Tools: Spark, Hive, BigQuery
Stream Processing:
Process data as it arrives
Latency: Milliseconds to seconds
Example: "Flag fraudulent transaction in real-time"
Tools: Kafka Streams, Flink, Spark Structured Streaming
Lambda Architecture (hybrid):
Speed layer: Real-time approximate results
Batch layer: Correct results on delay
Serving layer: Merge both for queries
Kappa Architecture (stream-first):
Everything is a stream
Replay stream for reprocessing
No separate batch layer
Apache Kafka
# Kafka Producer: Publish events
from kafka import KafkaProducer
import json
producer = KafkaProducer(
bootstrap_servers=['kafka1:9092', 'kafka2:9092'],
value_serializer=lambda v: json.dumps(v).encode('utf-8'),
acks='all', # Wait for all replicas (durability guarantee)
retries=3,
enable_idempotence=True, # Exactly-once production
)
def publish_order_event(order):
producer.send(
topic='order-events',
key=order['id'].encode('utf-8'), # Same key → same partition → order guarantee
value={
'event_type': 'ORDER_CREATED',
'order_id': order['id'],
'customer_id': order['customer_id'],
'total': str(order['total']),
'timestamp': datetime.utcnow().isoformat(),
}
)
# Kafka Consumer: Process events
from kafka import KafkaConsumer
consumer = KafkaConsumer(
'order-events',
bootstrap_servers=['kafka1:9092'],
group_id='fraud-detection',
auto_offset_reset='earliest',
enable_auto_commit=False, # Manual commit for exactly-once
value_deserializer=lambda v: json.loads(v.decode('utf-8')),
)
for message in consumer:
event = message.value
result = detect_fraud(event)
if result.is_fraud:
publish_alert(event, result)
consumer.commit() # Commit after processing
Windowing
Tumbling Window:
Fixed-size, non-overlapping windows
[0-5min] [5-10min] [10-15min]
Use case: "Count orders per 5-minute window"
Sliding Window:
Fixed-size, overlapping windows with slide interval
[0-5min] [1-6min] [2-7min] (1-min slide)
Use case: "Average latency over 5-min sliding window"
Session Window:
Dynamic size, based on activity gaps
Events within gap threshold belong to same window
Use case: "Group user clickstream into sessions"
Global Window:
Single window for all data
With custom triggers
Use case: "Count unique users across all time"
Apache Flink
// Flink: Real-time fraud detection
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
// Enable exactly-once checkpointing
env.enableCheckpointing(60000);
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
DataStream<Transaction> transactions = env
.addSource(new FlinkKafkaConsumer<>(
"transactions",
new TransactionSchema(),
kafkaProperties
));
// Pattern: flag if > $10,000 in 10 minutes from same account
DataStream<Alert> alerts = transactions
.keyBy(Transaction::getAccountId)
.window(SlidingEventTimeWindows.of(
Time.minutes(10),
Time.minutes(1)
))
.aggregate(new SumAggregator())
.filter(sum -> sum.getTotal() > 10000)
.map(sum -> new Alert(sum.getAccountId(), "HIGH_VALUE", sum.getTotal()));
alerts.addSink(new FlinkKafkaProducer<>("fraud-alerts", alertSchema, kafkaProperties));
env.execute("Fraud Detection Pipeline");
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| No backpressure handling | Consumer overwhelmed, data loss | Rate limiting, buffering, scaling |
| At-most-once processing | Data loss on failure | Exactly-once with checkpointing |
| No dead letter queue | Failed events disappear | DLQ for retry and investigation |
| Unbounded state | Memory grows forever | State TTL, windows with eviction |
| No replay capability | Cannot reprocess historical data | Kafka with retention, replay from offset |
Stream processing is event-driven architecture in motion. The data never stops flowing, and your system must keep up — reliably, durably, and at scale.