Data Pipeline Monitoring & Alerting

A data pipeline that silently fails is worse than one that loudly crashes. When a pipeline crashes, you get an alert. When it silently produces wrong data, the CFO discovers it three weeks later during a board meeting. Pipeline monitoring must catch both crashes and silent data quality degradation.

What to Monitor

Category	Metric	Alert When
Pipeline health	Job status (success/fail)	Any failure
Data freshness	Time since last successful load	Exceeds SLA (e.g., > 2 hours for hourly)
Data volume	Row count per run	< 50% or > 200% of typical volume
Data quality	Test pass rate	Any critical test fails
Performance	Pipeline duration	> 2x typical duration
Resource usage	CPU, memory, disk	> 80% utilization

Alerting Architecture

Data Pipeline                   Monitoring              Alerting
┌──────────┐                  ┌──────────────┐       ┌──────────────┐
│ Airflow  ├── metrics ──────▶│ Prometheus   │──────▶│ PagerDuty    │
│ dbt      ├── logs ─────────▶│ Grafana      │       │ Slack        │
│ Spark    ├── test results ─▶│ Datadog      │       │ Email        │
│ Fivetran ├── metadata ─────▶│ Monte Carlo  │       │              │
└──────────┘                  └──────────────┘       └──────────────┘

Freshness SLAs

Data Tier	Max Staleness	Check Frequency	Impact of Breach
Real-time	5 minutes	Every minute	Trading, fraud detection
Near-real-time	1 hour	Every 15 minutes	Operational dashboards
Daily	4 hours past schedule	Every 30 minutes	Business reports
Weekly	24 hours past schedule	Every 4 hours	Analytics, cohort analysis

-- dbt freshness test
sources:
  - name: stripe
    tables:
      - name: payments
        loaded_at_field: updated_at
        freshness:
          warn_after: {count: 2, period: hour}
          error_after: {count: 4, period: hour}

Anomaly Detection

def check_volume_anomaly(table, current_count, lookback_days=30):
    """Detect unusual row counts using statistical bounds."""
    historical = get_daily_counts(table, lookback_days)
    
    mean = statistics.mean(historical)
    stddev = statistics.stdev(historical)
    
    lower_bound = mean - (3 * stddev)
    upper_bound = mean + (3 * stddev)
    
    if current_count < lower_bound:
        alert(f"{table}: Row count {current_count} is unusually LOW "
              f"(expected {lower_bound:.0f}-{upper_bound:.0f})")
    elif current_count > upper_bound:
        alert(f"{table}: Row count {current_count} is unusually HIGH "
              f"(expected {lower_bound:.0f}-{upper_bound:.0f})")

Anti-Patterns

Anti-Pattern	Problem	Fix
Alert on every failure	Alert fatigue, alerts ignored	Categorize: critical vs warning, deduplicate
No freshness monitoring	Stale data served without anyone knowing	Freshness SLAs with automated checks
Volume checks only	Correct count but wrong data	Combine volume + quality + freshness
Page on non-actionable alerts	Engineers wake up, can’t do anything	Every page must have a runbook
DBA monitors everything	Bottleneck, slow response	Data team owns their pipeline alerts

Checklist

Freshness SLAs defined per data tier
Pipeline health monitoring (success/fail/duration)
Volume anomaly detection (statistical bounds)
Data quality test results tracked and alerted
Lineage-based impact analysis (which dashboards affected?)
Alert routing: critical vs warning, right team
Runbooks for every alert type
Dashboards: pipeline status, freshness, quality score

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For data engineering consulting, visit garnetgrid.com. :::

What to Monitor

Alerting Architecture

Freshness SLAs

Anomaly Detection

Anti-Patterns

Checklist

More in Data Engineering

CDC Pipeline Architecture

Change Data Capture (CDC) Patterns

Batch Processing at Scale