ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

Data Pipeline Architecture That Scales Without Rewriting Everything

Design data pipelines that survive growing data volumes, changing schemas, and the inevitable 3 AM failure. Covers batch vs streaming, orchestration, schema evolution, data quality gates, and the patterns that prevent 'Big Rewrite 2.0.'

Every data pipeline starts the same way: a cron job that runs a Python script that queries a database and dumps results into a CSV. It works for 6 months. Then data volume doubles, the source schema changes, the script silently starts producing wrong numbers, and someone makes a business decision based on data that has been wrong for 3 weeks.

This guide covers how to build data pipelines that are observable, recoverable, and maintainable — because the hardest part of data engineering is not moving data from A to B, it is knowing when the data at B is wrong.


Batch vs Streaming: The Decision Framework

This is the first architectural decision, and most teams get it wrong by defaulting to streaming because it sounds more modern.

FactorBatchStreaming
Latency requirementMinutes to hours is fineSeconds to sub-second required
Data volumeAny (batch handles massive volumes well)High throughput, continuous flow
ComplexityLower (well-understood patterns)Higher (ordering, exactly-once, backpressure)
CostLower (resources used only during runs)Higher (always-on infrastructure)
DebuggingEasier (replay specific runs)Harder (stateful, distributed)
When to chooseAnalytics, reporting, ETL, ML trainingReal-time dashboards, fraud detection, event processing
Decision tree:

  "Do users need results in < 1 minute?"
    ├─ No  → Batch. Stop overthinking it.
    └─ Yes → "Is the data naturally event-driven?"
              ├─ No  → Micro-batch (5-minute windows). Simpler than streaming.
              └─ Yes → Streaming. Accept the complexity.

The uncomfortable truth: 80% of “real-time” requirements are actually “fast enough batch.” Before committing to streaming architecture, ask: “What business decision changes if this data is 15 minutes old instead of 15 seconds old?” If the answer is “none,” use batch.


Pipeline Architecture Patterns

Pattern 1: Extract-Load-Transform (ELT)

The modern default. Extract raw data, load it into a warehouse, transform it using SQL inside the warehouse.

┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐
│  Sources │───▶│ Ingestion│───▶│ Raw Layer│───▶│ Transform│
│  (APIs,  │    │ (Fivetran│    │ (BQ, Snow│    │ (dbt)    │
│  DBs,    │    │  Airbyte)│    │  flake)  │    │          │
│  Files)  │    └──────────┘    └──────────┘    └──────────┘
└──────────┘                                         │

                                              ┌──────────┐
                                              │ Marts    │
                                              │ (curated │
                                              │  tables) │
                                              └──────────┘

When to use: Analytics, reporting, data science. The warehouse does the heavy lifting.

Pattern 2: Event-Driven Pipeline

┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐
│ Producers│───▶│ Message  │───▶│ Stream   │───▶│ Sinks    │
│ (Apps,   │    │ Broker   │    │ Processor│    │ (DB, S3, │
│  IoT)    │    │ (Kafka)  │    │ (Flink)  │    │  API)    │
└──────────┘    └──────────┘    └──────────┘    └──────────┘

When to use: Real-time fraud detection, live dashboards, event sourcing.

Pattern 3: Medallion Architecture (Bronze → Silver → Gold)

Bronze (Raw):     Exact copy of source data. No transformations.
                  Schema-on-read. Append-only. The safety net.

Silver (Cleaned): Deduplicated, validated, typed, joined.
                  Schema-on-write. Business logic applied.

Gold (Curated):   Business-ready aggregations and metrics.
                  Optimized for specific use cases and consumers.
LayerWho ConsumesQuality LevelRetention
BronzeData engineers (debugging)Raw, uncleanedForever (it is your backup)
SilverData scientists, analystsCleaned, validated2-5 years
GoldBusiness users, dashboardsAggregated, fastAs needed

Schema Evolution: The Silent Killer

Upstream schemas change. They always change. A column gets renamed, a field becomes nullable, a new field appears, an enum gains a value. If your pipeline does not handle this, it breaks — often silently.

Defensive Schema Patterns

# Pattern: Schema validation at ingestion
from pydantic import BaseModel, validator
from typing import Optional
from datetime import datetime

class OrderEvent(BaseModel):
    order_id: str
    customer_id: str
    total_amount: float
    currency: str = "USD"                    # Default for new field
    status: str
    created_at: datetime
    # Fields that might appear later:
    shipping_method: Optional[str] = None    # Optional = won't break if missing
    discount_code: Optional[str] = None

    @validator('total_amount')
    def amount_must_be_positive(cls, v):
        if v < 0:
            raise ValueError(f'Invalid amount: {v}')
        return v

    @validator('status')
    def validate_status(cls, v):
        valid = {'pending', 'confirmed', 'shipped', 'delivered', 'cancelled'}
        if v not in valid:
            # Log warning but don't crash — new statuses might be legitimate
            logger.warning(f"Unknown order status: {v}")
        return v

Schema Change Response Matrix

Change TypeImpactResponse
New optional columnNone if handledAdd to schema with default
Column renamedPipeline breaksMap old name → new name in ingestion
Column removedPipeline breaksMake field optional, add monitoring
Type change (string → int)Data corruptionCast with validation, alert on failure
New enum valueLogic may breakLog unknown values, do not crash

Data Quality Gates

Never trust upstream data. Build automated quality checks that run on every pipeline execution:

# Data quality checks using Great Expectations style
quality_checks = {
    "orders_table": [
        # Completeness: no nulls in critical fields
        {"check": "not_null", "columns": ["order_id", "customer_id", "total_amount"]},

        # Uniqueness: no duplicate orders
        {"check": "unique", "columns": ["order_id"]},

        # Freshness: most recent order within last 2 hours
        {"check": "freshness", "column": "created_at", "max_age": "2 hours"},

        # Volume: between 1K and 100K orders per run (anomaly detection)
        {"check": "row_count", "min": 1000, "max": 100000},

        # Referential integrity: every order has a valid customer
        {"check": "referential", "column": "customer_id", "reference": "customers.id"},

        # Statistical: total_amount should be within historical range
        {"check": "range", "column": "total_amount", "min": 0.01, "max": 50000},
    ]
}
Check TypeWhat It CatchesSeverity
Null checkMissing required data🔴 Block pipeline
UniquenessDuplicate records🔴 Block pipeline
FreshnessStale data / broken source🔴 Block pipeline
Volume anomalySource outage or data explosion🟡 Alert, investigate
Distribution shiftUpstream logic change🟡 Alert, investigate
Schema validationUpstream schema change🔴 Block pipeline

Orchestration: Airflow vs the Alternatives

OrchestratorBest ForWeakness
Apache AirflowComplex DAGs, mature ecosystemHeavy, config-as-code is Python
DagsterSoftware-defined assets, testingNewer, smaller ecosystem
PrefectSimple orchestration, fast setupLess mature for complex DAGs
dbt CloudSQL-only transformation pipelinesNot for extraction or loading
Step FunctionsAWS-native event orchestrationVendor lock-in

Airflow DAG Best Practices

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'data-team',
    'retries': 3,
    'retry_delay': timedelta(minutes=5),
    'retry_exponential_backoff': True,
    'execution_timeout': timedelta(hours=2),  # Kill hung tasks
    'on_failure_callback': alert_slack,
}

with DAG(
    'orders_pipeline',
    default_args=default_args,
    schedule_interval='0 */4 * * *',  # Every 4 hours
    catchup=False,                     # Don't backfill on deploy
    max_active_runs=1,                 # Prevent overlapping runs
    tags=['production', 'orders'],
) as dag:

    extract = PythonOperator(
        task_id='extract_orders',
        python_callable=extract_orders_from_api,
    )

    validate = PythonOperator(
        task_id='validate_data_quality',
        python_callable=run_quality_checks,
    )

    load = PythonOperator(
        task_id='load_to_warehouse',
        python_callable=load_to_bigquery,
    )

    # Always validate before loading
    extract >> validate >> load

Implementation Checklist

  • Decide batch vs streaming based on actual latency requirements (not aspirational ones)
  • Implement medallion architecture: bronze (raw) → silver (cleaned) → gold (curated)
  • Add schema validation at ingestion — never trust upstream data types
  • Build data quality gates that block pipeline execution on critical failures
  • Set up freshness monitoring: alert if data is older than expected
  • Implement idempotent pipeline runs (re-running produces the same result)
  • Add volume anomaly detection (±50% from baseline triggers alert)
  • Document every pipeline: source, schedule, owner, SLA, and failure runbook
  • Track pipeline reliability: success rate, mean run time, data latency end-to-end
  • Keep bronze layer forever — it is your disaster recovery for data
Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →