Streaming Data Architecture
Design real-time data pipelines that process events as they occur. Covers stream processing frameworks, exactly-once semantics, windowing, stateful processing, and the patterns that make streaming architecture production-ready.
Batch processing answers questions about the past. Streaming answers questions about the present. When your business needs to detect fraud as it happens, update dashboards in real-time, or trigger alerts within seconds, batch processing is too slow. Streaming architecture processes events as they arrive, enabling sub-second response times.
Streaming vs Batch
Batch Processing:
Collect all orders from today → Process overnight → Dashboard updated at 6 AM
Latency: Hours
Use case: Reporting, analytics, ML training
Stream Processing:
Each order processed immediately → Dashboard updated in real-time
Latency: Milliseconds to seconds
Use case: Fraud detection, real-time analytics, alerting
Lambda Architecture (both):
Batch layer: Historical accuracy (reprocessable)
Speed layer: Real-time approximation
Serving layer: Merge batch + speed
Kappa Architecture (streaming only):
Everything is a stream
Historical data = replay the stream
One pipeline to maintain, not two
Stream Processing Frameworks
| Framework | Best For | Latency | Complexity |
|---|---|---|---|
| Apache Kafka Streams | JVM-native, stateful processing | Milliseconds | Medium |
| Apache Flink | Complex event processing, large scale | Milliseconds | High |
| Apache Spark Structured Streaming | Unified batch + stream | Sub-second | Medium |
| Amazon Kinesis Data Analytics | AWS-native, managed | Sub-second | Low |
| Dataflow (Google) | GCP-native, auto-scaling | Sub-second | Medium |
Windowing
Window Types
Tumbling Window (fixed, non-overlapping):
[0:00 - 0:05] [0:05 - 0:10] [0:10 - 0:15]
Use: 5-minute aggregations
Sliding Window (fixed, overlapping):
[0:00 - 0:05] [0:01 - 0:06] [0:02 - 0:07]
Use: Moving averages
Session Window (gap-based):
[login...click...click...[5 min gap]...click...logout]
Use: User session analytics
Hopping Window:
Size: 10 min, hop: 5 min → [0:00-0:10] [0:05-0:15]
Use: Overlapping aggregations
Flink Windowing Example
DataStream<OrderEvent> orders = env.addSource(kafkaConsumer);
orders
.keyBy(OrderEvent::getCustomerId)
.window(TumblingEventTimeWindows.of(Time.minutes(5)))
.aggregate(new OrderAggregator())
.addSink(dashboardSink);
// For late events
orders
.keyBy(OrderEvent::getCustomerId)
.window(TumblingEventTimeWindows.of(Time.minutes(5)))
.allowedLateness(Time.minutes(1)) // Accept events up to 1 min late
.sideOutputLateData(lateOutputTag) // Route late data separately
.aggregate(new OrderAggregator());
Stateful Processing
// Flink: Fraud detection with state
public class FraudDetector extends KeyedProcessFunction<String, Transaction, Alert> {
private ValueState<Boolean> flagState;
private ValueState<Long> timerState;
@Override
public void processElement(Transaction tx, Context ctx, Collector<Alert> out) {
Boolean previouslyFlagged = flagState.value();
if (previouslyFlagged != null && tx.getAmount() > 500) {
// Small transaction followed by large transaction → fraud alert
out.collect(new Alert(tx.getAccountId(), tx));
cleanup(ctx);
}
if (tx.getAmount() < 1.00) {
// Flag small "test" transactions
flagState.update(true);
// Set timer: if no large transaction in 1 minute, clear flag
long timer = ctx.timerService().currentProcessingTime() + 60_000;
ctx.timerService().registerProcessingTimeTimer(timer);
timerState.update(timer);
}
}
@Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<Alert> out) {
cleanup(ctx); // No large transaction within 1 minute
}
}
Exactly-Once Semantics
At-Most-Once:
Process message → don't track → if crash, message lost
Use: Metrics where occasional loss is acceptable
At-Least-Once:
Process message → crash before commit → replay → process again (duplicate)
Use: With idempotent operations (upsert)
Exactly-Once:
Process message + update state + commit offset atomically
Use: Financial transactions, critical business events
Kafka implementation:
Transactional producer + consumer isolation.level=read_committed
Flink checkpointing with Kafka offsets
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| Ignoring late events | Data loss, inaccurate windows | Allowed lateness + late data handling |
| No backpressure handling | OOM, pipeline crash under load | Backpressure propagation, rate limiting |
| State without checkpointing | State lost on failure | Regular checkpointing to durable storage |
| Streaming everything | Unnecessary complexity for batch use cases | Stream what needs real-time, batch the rest |
| No dead letter queue | Malformed events block pipeline | Route failures to DLQ for investigation |
Streaming architecture is about matching the speed of your data processing to the speed your business needs. Not everything needs to be real-time — but the things that do need to be reliably real-time.