Data Pipeline Architecture That Scales Without Rewriting Everything

Every data pipeline starts the same way: a cron job that runs a Python script that queries a database and dumps results into a CSV. It works for 6 months. Then data volume doubles, the source schema changes, the script silently starts producing wrong numbers, and someone makes a business decision based on data that has been wrong for 3 weeks.

This guide covers how to build data pipelines that are observable, recoverable, and maintainable — because the hardest part of data engineering is not moving data from A to B, it is knowing when the data at B is wrong.

Batch vs Streaming: The Decision Framework

This is the first architectural decision, and most teams get it wrong by defaulting to streaming because it sounds more modern.

Factor	Batch	Streaming
Latency requirement	Minutes to hours is fine	Seconds to sub-second required
Data volume	Any (batch handles massive volumes well)	High throughput, continuous flow
Complexity	Lower (well-understood patterns)	Higher (ordering, exactly-once, backpressure)
Cost	Lower (resources used only during runs)	Higher (always-on infrastructure)
Debugging	Easier (replay specific runs)	Harder (stateful, distributed)
When to choose	Analytics, reporting, ETL, ML training	Real-time dashboards, fraud detection, event processing

Decision tree:

  "Do users need results in < 1 minute?"
    ├─ No  → Batch. Stop overthinking it.
    └─ Yes → "Is the data naturally event-driven?"
              ├─ No  → Micro-batch (5-minute windows). Simpler than streaming.
              └─ Yes → Streaming. Accept the complexity.

The uncomfortable truth: 80% of “real-time” requirements are actually “fast enough batch.” Before committing to streaming architecture, ask: “What business decision changes if this data is 15 minutes old instead of 15 seconds old?” If the answer is “none,” use batch.

Pipeline Architecture Patterns

Pattern 1: Extract-Load-Transform (ELT)

The modern default. Extract raw data, load it into a warehouse, transform it using SQL inside the warehouse.

┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐
│  Sources │───▶│ Ingestion│───▶│ Raw Layer│───▶│ Transform│
│  (APIs,  │    │ (Fivetran│    │ (BQ, Snow│    │ (dbt)    │
│  DBs,    │    │  Airbyte)│    │  flake)  │    │          │
│  Files)  │    └──────────┘    └──────────┘    └──────────┘
└──────────┘                                         │
                                                     ▼
                                              ┌──────────┐
                                              │ Marts    │
                                              │ (curated │
                                              │  tables) │
                                              └──────────┘

When to use: Analytics, reporting, data science. The warehouse does the heavy lifting.

Pattern 2: Event-Driven Pipeline

┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐
│ Producers│───▶│ Message  │───▶│ Stream   │───▶│ Sinks    │
│ (Apps,   │    │ Broker   │    │ Processor│    │ (DB, S3, │
│  IoT)    │    │ (Kafka)  │    │ (Flink)  │    │  API)    │
└──────────┘    └──────────┘    └──────────┘    └──────────┘

When to use: Real-time fraud detection, live dashboards, event sourcing.

Pattern 3: Medallion Architecture (Bronze → Silver → Gold)

Bronze (Raw):     Exact copy of source data. No transformations.
                  Schema-on-read. Append-only. The safety net.

Silver (Cleaned): Deduplicated, validated, typed, joined.
                  Schema-on-write. Business logic applied.

Gold (Curated):   Business-ready aggregations and metrics.
                  Optimized for specific use cases and consumers.

Layer	Who Consumes	Quality Level	Retention
Bronze	Data engineers (debugging)	Raw, uncleaned	Forever (it is your backup)
Silver	Data scientists, analysts	Cleaned, validated	2-5 years
Gold	Business users, dashboards	Aggregated, fast	As needed

Schema Evolution: The Silent Killer

Upstream schemas change. They always change. A column gets renamed, a field becomes nullable, a new field appears, an enum gains a value. If your pipeline does not handle this, it breaks — often silently.

Defensive Schema Patterns

# Pattern: Schema validation at ingestion
from pydantic import BaseModel, validator
from typing import Optional
from datetime import datetime

class OrderEvent(BaseModel):
    order_id: str
    customer_id: str
    total_amount: float
    currency: str = "USD"                    # Default for new field
    status: str
    created_at: datetime
    # Fields that might appear later:
    shipping_method: Optional[str] = None    # Optional = won't break if missing
    discount_code: Optional[str] = None

    @validator('total_amount')
    def amount_must_be_positive(cls, v):
        if v < 0:
            raise ValueError(f'Invalid amount: {v}')
        return v

    @validator('status')
    def validate_status(cls, v):
        valid = {'pending', 'confirmed', 'shipped', 'delivered', 'cancelled'}
        if v not in valid:
            # Log warning but don't crash — new statuses might be legitimate
            logger.warning(f"Unknown order status: {v}")
        return v

Schema Change Response Matrix

Change Type	Impact	Response
New optional column	None if handled	Add to schema with default
Column renamed	Pipeline breaks	Map old name → new name in ingestion
Column removed	Pipeline breaks	Make field optional, add monitoring
Type change (string → int)	Data corruption	Cast with validation, alert on failure
New enum value	Logic may break	Log unknown values, do not crash

Data Quality Gates

Never trust upstream data. Build automated quality checks that run on every pipeline execution:

# Data quality checks using Great Expectations style
quality_checks = {
    "orders_table": [
        # Completeness: no nulls in critical fields
        {"check": "not_null", "columns": ["order_id", "customer_id", "total_amount"]},

        # Uniqueness: no duplicate orders
        {"check": "unique", "columns": ["order_id"]},

        # Freshness: most recent order within last 2 hours
        {"check": "freshness", "column": "created_at", "max_age": "2 hours"},

        # Volume: between 1K and 100K orders per run (anomaly detection)
        {"check": "row_count", "min": 1000, "max": 100000},

        # Referential integrity: every order has a valid customer
        {"check": "referential", "column": "customer_id", "reference": "customers.id"},

        # Statistical: total_amount should be within historical range
        {"check": "range", "column": "total_amount", "min": 0.01, "max": 50000},
    ]
}

Check Type	What It Catches	Severity
Null check	Missing required data	🔴 Block pipeline
Uniqueness	Duplicate records	🔴 Block pipeline
Freshness	Stale data / broken source	🔴 Block pipeline
Volume anomaly	Source outage or data explosion	🟡 Alert, investigate
Distribution shift	Upstream logic change	🟡 Alert, investigate
Schema validation	Upstream schema change	🔴 Block pipeline

Orchestration: Airflow vs the Alternatives

Orchestrator	Best For	Weakness
Apache Airflow	Complex DAGs, mature ecosystem	Heavy, config-as-code is Python
Dagster	Software-defined assets, testing	Newer, smaller ecosystem
Prefect	Simple orchestration, fast setup	Less mature for complex DAGs
dbt Cloud	SQL-only transformation pipelines	Not for extraction or loading
Step Functions	AWS-native event orchestration	Vendor lock-in

Airflow DAG Best Practices

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'data-team',
    'retries': 3,
    'retry_delay': timedelta(minutes=5),
    'retry_exponential_backoff': True,
    'execution_timeout': timedelta(hours=2),  # Kill hung tasks
    'on_failure_callback': alert_slack,
}

with DAG(
    'orders_pipeline',
    default_args=default_args,
    schedule_interval='0 */4 * * *',  # Every 4 hours
    catchup=False,                     # Don't backfill on deploy
    max_active_runs=1,                 # Prevent overlapping runs
    tags=['production', 'orders'],
) as dag:

    extract = PythonOperator(
        task_id='extract_orders',
        python_callable=extract_orders_from_api,
    )

    validate = PythonOperator(
        task_id='validate_data_quality',
        python_callable=run_quality_checks,
    )

    load = PythonOperator(
        task_id='load_to_warehouse',
        python_callable=load_to_bigquery,
    )

    # Always validate before loading
    extract >> validate >> load