Streaming Data Architecture

Batch processing answers questions about the past. Streaming answers questions about the present. When your business needs to detect fraud as it happens, update dashboards in real-time, or trigger alerts within seconds, batch processing is too slow. Streaming architecture processes events as they arrive, enabling sub-second response times.

Streaming vs Batch

Batch Processing:
  Collect all orders from today → Process overnight → Dashboard updated at 6 AM
  Latency: Hours
  Use case: Reporting, analytics, ML training

Stream Processing:
  Each order processed immediately → Dashboard updated in real-time
  Latency: Milliseconds to seconds
  Use case: Fraud detection, real-time analytics, alerting

Lambda Architecture (both):
  Batch layer: Historical accuracy (reprocessable)
  Speed layer: Real-time approximation
  Serving layer: Merge batch + speed

Kappa Architecture (streaming only):
  Everything is a stream
  Historical data = replay the stream
  One pipeline to maintain, not two

Stream Processing Frameworks

Framework	Best For	Latency	Complexity
Apache Kafka Streams	JVM-native, stateful processing	Milliseconds	Medium
Apache Flink	Complex event processing, large scale	Milliseconds	High
Apache Spark Structured Streaming	Unified batch + stream	Sub-second	Medium
Amazon Kinesis Data Analytics	AWS-native, managed	Sub-second	Low
Dataflow (Google)	GCP-native, auto-scaling	Sub-second	Medium

Windowing

Window Types

Tumbling Window (fixed, non-overlapping):
  [0:00 - 0:05] [0:05 - 0:10] [0:10 - 0:15]
  Use: 5-minute aggregations

Sliding Window (fixed, overlapping):
  [0:00 - 0:05] [0:01 - 0:06] [0:02 - 0:07]
  Use: Moving averages

Session Window (gap-based):
  [login...click...click...[5 min gap]...click...logout]
  Use: User session analytics

Hopping Window:
  Size: 10 min, hop: 5 min → [0:00-0:10] [0:05-0:15]
  Use: Overlapping aggregations

Flink Windowing Example

DataStream<OrderEvent> orders = env.addSource(kafkaConsumer);

orders
    .keyBy(OrderEvent::getCustomerId)
    .window(TumblingEventTimeWindows.of(Time.minutes(5)))
    .aggregate(new OrderAggregator())
    .addSink(dashboardSink);

// For late events
orders
    .keyBy(OrderEvent::getCustomerId)
    .window(TumblingEventTimeWindows.of(Time.minutes(5)))
    .allowedLateness(Time.minutes(1))  // Accept events up to 1 min late
    .sideOutputLateData(lateOutputTag)  // Route late data separately
    .aggregate(new OrderAggregator());

Stateful Processing

// Flink: Fraud detection with state
public class FraudDetector extends KeyedProcessFunction<String, Transaction, Alert> {
    
    private ValueState<Boolean> flagState;
    private ValueState<Long> timerState;
    
    @Override
    public void processElement(Transaction tx, Context ctx, Collector<Alert> out) {
        Boolean previouslyFlagged = flagState.value();
        
        if (previouslyFlagged != null && tx.getAmount() > 500) {
            // Small transaction followed by large transaction → fraud alert
            out.collect(new Alert(tx.getAccountId(), tx));
            cleanup(ctx);
        }
        
        if (tx.getAmount() < 1.00) {
            // Flag small "test" transactions
            flagState.update(true);
            
            // Set timer: if no large transaction in 1 minute, clear flag
            long timer = ctx.timerService().currentProcessingTime() + 60_000;
            ctx.timerService().registerProcessingTimeTimer(timer);
            timerState.update(timer);
        }
    }
    
    @Override
    public void onTimer(long timestamp, OnTimerContext ctx, Collector<Alert> out) {
        cleanup(ctx);  // No large transaction within 1 minute
    }
}

Exactly-Once Semantics

At-Most-Once:
  Process message → don't track → if crash, message lost
  Use: Metrics where occasional loss is acceptable

At-Least-Once:
  Process message → crash before commit → replay → process again (duplicate)
  Use: With idempotent operations (upsert)

Exactly-Once:
  Process message + update state + commit offset atomically
  Use: Financial transactions, critical business events
  
Kafka implementation:
  Transactional producer + consumer isolation.level=read_committed
  Flink checkpointing with Kafka offsets

Anti-Patterns

Anti-Pattern	Consequence	Fix
Ignoring late events	Data loss, inaccurate windows	Allowed lateness + late data handling
No backpressure handling	OOM, pipeline crash under load	Backpressure propagation, rate limiting
State without checkpointing	State lost on failure	Regular checkpointing to durable storage
Streaming everything	Unnecessary complexity for batch use cases	Stream what needs real-time, batch the rest
No dead letter queue	Malformed events block pipeline	Route failures to DLQ for investigation

Streaming architecture is about matching the speed of your data processing to the speed your business needs. Not everything needs to be real-time — but the things that do need to be reliably real-time.