Verified by Garnet Grid

Data Pipeline Monitoring & Alerting

Monitor data pipelines effectively. Covers pipeline observability, data freshness SLAs, failure detection, lineage-based impact analysis, and alerting without fatigue.

A data pipeline that silently fails is worse than one that loudly crashes. When a pipeline crashes, you get an alert. When it silently produces wrong data, the CFO discovers it three weeks later during a board meeting. Pipeline monitoring must catch both crashes and silent data quality degradation.


What to Monitor

CategoryMetricAlert When
Pipeline healthJob status (success/fail)Any failure
Data freshnessTime since last successful loadExceeds SLA (e.g., > 2 hours for hourly)
Data volumeRow count per run< 50% or > 200% of typical volume
Data qualityTest pass rateAny critical test fails
PerformancePipeline duration> 2x typical duration
Resource usageCPU, memory, disk> 80% utilization

Alerting Architecture

Data Pipeline                   Monitoring              Alerting
┌──────────┐                  ┌──────────────┐       ┌──────────────┐
│ Airflow  ├── metrics ──────▶│ Prometheus   │──────▶│ PagerDuty    │
│ dbt      ├── logs ─────────▶│ Grafana      │       │ Slack        │
│ Spark    ├── test results ─▶│ Datadog      │       │ Email        │
│ Fivetran ├── metadata ─────▶│ Monte Carlo  │       │              │
└──────────┘                  └──────────────┘       └──────────────┘

Freshness SLAs

Data TierMax StalenessCheck FrequencyImpact of Breach
Real-time5 minutesEvery minuteTrading, fraud detection
Near-real-time1 hourEvery 15 minutesOperational dashboards
Daily4 hours past scheduleEvery 30 minutesBusiness reports
Weekly24 hours past scheduleEvery 4 hoursAnalytics, cohort analysis
-- dbt freshness test
sources:
  - name: stripe
    tables:
      - name: payments
        loaded_at_field: updated_at
        freshness:
          warn_after: {count: 2, period: hour}
          error_after: {count: 4, period: hour}

Anomaly Detection

def check_volume_anomaly(table, current_count, lookback_days=30):
    """Detect unusual row counts using statistical bounds."""
    historical = get_daily_counts(table, lookback_days)
    
    mean = statistics.mean(historical)
    stddev = statistics.stdev(historical)
    
    lower_bound = mean - (3 * stddev)
    upper_bound = mean + (3 * stddev)
    
    if current_count < lower_bound:
        alert(f"{table}: Row count {current_count} is unusually LOW "
              f"(expected {lower_bound:.0f}-{upper_bound:.0f})")
    elif current_count > upper_bound:
        alert(f"{table}: Row count {current_count} is unusually HIGH "
              f"(expected {lower_bound:.0f}-{upper_bound:.0f})")

Anti-Patterns

Anti-PatternProblemFix
Alert on every failureAlert fatigue, alerts ignoredCategorize: critical vs warning, deduplicate
No freshness monitoringStale data served without anyone knowingFreshness SLAs with automated checks
Volume checks onlyCorrect count but wrong dataCombine volume + quality + freshness
Page on non-actionable alertsEngineers wake up, can’t do anythingEvery page must have a runbook
DBA monitors everythingBottleneck, slow responseData team owns their pipeline alerts

Checklist

  • Freshness SLAs defined per data tier
  • Pipeline health monitoring (success/fail/duration)
  • Volume anomaly detection (statistical bounds)
  • Data quality test results tracked and alerted
  • Lineage-based impact analysis (which dashboards affected?)
  • Alert routing: critical vs warning, right team
  • Runbooks for every alert type
  • Dashboards: pipeline status, freshness, quality score

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For data engineering consulting, visit garnetgrid.com. :::

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →