Data Pipeline Architecture That Scales Without Rewriting Everything
Design data pipelines that survive growing data volumes, changing schemas, and the inevitable 3 AM failure. Covers batch vs streaming, orchestration, schema evolution, data quality gates, and the patterns that prevent 'Big Rewrite 2.0.'
Every data pipeline starts the same way: a cron job that runs a Python script that queries a database and dumps results into a CSV. It works for 6 months. Then data volume doubles, the source schema changes, the script silently starts producing wrong numbers, and someone makes a business decision based on data that has been wrong for 3 weeks.
This guide covers how to build data pipelines that are observable, recoverable, and maintainable — because the hardest part of data engineering is not moving data from A to B, it is knowing when the data at B is wrong.
Batch vs Streaming: The Decision Framework
This is the first architectural decision, and most teams get it wrong by defaulting to streaming because it sounds more modern.
| Factor | Batch | Streaming |
|---|---|---|
| Latency requirement | Minutes to hours is fine | Seconds to sub-second required |
| Data volume | Any (batch handles massive volumes well) | High throughput, continuous flow |
| Complexity | Lower (well-understood patterns) | Higher (ordering, exactly-once, backpressure) |
| Cost | Lower (resources used only during runs) | Higher (always-on infrastructure) |
| Debugging | Easier (replay specific runs) | Harder (stateful, distributed) |
| When to choose | Analytics, reporting, ETL, ML training | Real-time dashboards, fraud detection, event processing |
Decision tree:
"Do users need results in < 1 minute?"
├─ No → Batch. Stop overthinking it.
└─ Yes → "Is the data naturally event-driven?"
├─ No → Micro-batch (5-minute windows). Simpler than streaming.
└─ Yes → Streaming. Accept the complexity.
The uncomfortable truth: 80% of “real-time” requirements are actually “fast enough batch.” Before committing to streaming architecture, ask: “What business decision changes if this data is 15 minutes old instead of 15 seconds old?” If the answer is “none,” use batch.
Pipeline Architecture Patterns
Pattern 1: Extract-Load-Transform (ELT)
The modern default. Extract raw data, load it into a warehouse, transform it using SQL inside the warehouse.
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Sources │───▶│ Ingestion│───▶│ Raw Layer│───▶│ Transform│
│ (APIs, │ │ (Fivetran│ │ (BQ, Snow│ │ (dbt) │
│ DBs, │ │ Airbyte)│ │ flake) │ │ │
│ Files) │ └──────────┘ └──────────┘ └──────────┘
└──────────┘ │
▼
┌──────────┐
│ Marts │
│ (curated │
│ tables) │
└──────────┘
When to use: Analytics, reporting, data science. The warehouse does the heavy lifting.
Pattern 2: Event-Driven Pipeline
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Producers│───▶│ Message │───▶│ Stream │───▶│ Sinks │
│ (Apps, │ │ Broker │ │ Processor│ │ (DB, S3, │
│ IoT) │ │ (Kafka) │ │ (Flink) │ │ API) │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
When to use: Real-time fraud detection, live dashboards, event sourcing.
Pattern 3: Medallion Architecture (Bronze → Silver → Gold)
Bronze (Raw): Exact copy of source data. No transformations.
Schema-on-read. Append-only. The safety net.
Silver (Cleaned): Deduplicated, validated, typed, joined.
Schema-on-write. Business logic applied.
Gold (Curated): Business-ready aggregations and metrics.
Optimized for specific use cases and consumers.
| Layer | Who Consumes | Quality Level | Retention |
|---|---|---|---|
| Bronze | Data engineers (debugging) | Raw, uncleaned | Forever (it is your backup) |
| Silver | Data scientists, analysts | Cleaned, validated | 2-5 years |
| Gold | Business users, dashboards | Aggregated, fast | As needed |
Schema Evolution: The Silent Killer
Upstream schemas change. They always change. A column gets renamed, a field becomes nullable, a new field appears, an enum gains a value. If your pipeline does not handle this, it breaks — often silently.
Defensive Schema Patterns
# Pattern: Schema validation at ingestion
from pydantic import BaseModel, validator
from typing import Optional
from datetime import datetime
class OrderEvent(BaseModel):
order_id: str
customer_id: str
total_amount: float
currency: str = "USD" # Default for new field
status: str
created_at: datetime
# Fields that might appear later:
shipping_method: Optional[str] = None # Optional = won't break if missing
discount_code: Optional[str] = None
@validator('total_amount')
def amount_must_be_positive(cls, v):
if v < 0:
raise ValueError(f'Invalid amount: {v}')
return v
@validator('status')
def validate_status(cls, v):
valid = {'pending', 'confirmed', 'shipped', 'delivered', 'cancelled'}
if v not in valid:
# Log warning but don't crash — new statuses might be legitimate
logger.warning(f"Unknown order status: {v}")
return v
Schema Change Response Matrix
| Change Type | Impact | Response |
|---|---|---|
| New optional column | None if handled | Add to schema with default |
| Column renamed | Pipeline breaks | Map old name → new name in ingestion |
| Column removed | Pipeline breaks | Make field optional, add monitoring |
| Type change (string → int) | Data corruption | Cast with validation, alert on failure |
| New enum value | Logic may break | Log unknown values, do not crash |
Data Quality Gates
Never trust upstream data. Build automated quality checks that run on every pipeline execution:
# Data quality checks using Great Expectations style
quality_checks = {
"orders_table": [
# Completeness: no nulls in critical fields
{"check": "not_null", "columns": ["order_id", "customer_id", "total_amount"]},
# Uniqueness: no duplicate orders
{"check": "unique", "columns": ["order_id"]},
# Freshness: most recent order within last 2 hours
{"check": "freshness", "column": "created_at", "max_age": "2 hours"},
# Volume: between 1K and 100K orders per run (anomaly detection)
{"check": "row_count", "min": 1000, "max": 100000},
# Referential integrity: every order has a valid customer
{"check": "referential", "column": "customer_id", "reference": "customers.id"},
# Statistical: total_amount should be within historical range
{"check": "range", "column": "total_amount", "min": 0.01, "max": 50000},
]
}
| Check Type | What It Catches | Severity |
|---|---|---|
| Null check | Missing required data | 🔴 Block pipeline |
| Uniqueness | Duplicate records | 🔴 Block pipeline |
| Freshness | Stale data / broken source | 🔴 Block pipeline |
| Volume anomaly | Source outage or data explosion | 🟡 Alert, investigate |
| Distribution shift | Upstream logic change | 🟡 Alert, investigate |
| Schema validation | Upstream schema change | 🔴 Block pipeline |
Orchestration: Airflow vs the Alternatives
| Orchestrator | Best For | Weakness |
|---|---|---|
| Apache Airflow | Complex DAGs, mature ecosystem | Heavy, config-as-code is Python |
| Dagster | Software-defined assets, testing | Newer, smaller ecosystem |
| Prefect | Simple orchestration, fast setup | Less mature for complex DAGs |
| dbt Cloud | SQL-only transformation pipelines | Not for extraction or loading |
| Step Functions | AWS-native event orchestration | Vendor lock-in |
Airflow DAG Best Practices
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'data-team',
'retries': 3,
'retry_delay': timedelta(minutes=5),
'retry_exponential_backoff': True,
'execution_timeout': timedelta(hours=2), # Kill hung tasks
'on_failure_callback': alert_slack,
}
with DAG(
'orders_pipeline',
default_args=default_args,
schedule_interval='0 */4 * * *', # Every 4 hours
catchup=False, # Don't backfill on deploy
max_active_runs=1, # Prevent overlapping runs
tags=['production', 'orders'],
) as dag:
extract = PythonOperator(
task_id='extract_orders',
python_callable=extract_orders_from_api,
)
validate = PythonOperator(
task_id='validate_data_quality',
python_callable=run_quality_checks,
)
load = PythonOperator(
task_id='load_to_warehouse',
python_callable=load_to_bigquery,
)
# Always validate before loading
extract >> validate >> load
Implementation Checklist
- Decide batch vs streaming based on actual latency requirements (not aspirational ones)
- Implement medallion architecture: bronze (raw) → silver (cleaned) → gold (curated)
- Add schema validation at ingestion — never trust upstream data types
- Build data quality gates that block pipeline execution on critical failures
- Set up freshness monitoring: alert if data is older than expected
- Implement idempotent pipeline runs (re-running produces the same result)
- Add volume anomaly detection (±50% from baseline triggers alert)
- Document every pipeline: source, schedule, owner, SLA, and failure runbook
- Track pipeline reliability: success rate, mean run time, data latency end-to-end
- Keep bronze layer forever — it is your disaster recovery for data