Data Pipeline Monitoring & Alerting
Monitor data pipelines effectively. Covers pipeline observability, data freshness SLAs, failure detection, lineage-based impact analysis, and alerting without fatigue.
A data pipeline that silently fails is worse than one that loudly crashes. When a pipeline crashes, you get an alert. When it silently produces wrong data, the CFO discovers it three weeks later during a board meeting. Pipeline monitoring must catch both crashes and silent data quality degradation.
What to Monitor
| Category | Metric | Alert When |
|---|---|---|
| Pipeline health | Job status (success/fail) | Any failure |
| Data freshness | Time since last successful load | Exceeds SLA (e.g., > 2 hours for hourly) |
| Data volume | Row count per run | < 50% or > 200% of typical volume |
| Data quality | Test pass rate | Any critical test fails |
| Performance | Pipeline duration | > 2x typical duration |
| Resource usage | CPU, memory, disk | > 80% utilization |
Alerting Architecture
Data Pipeline Monitoring Alerting
┌──────────┐ ┌──────────────┐ ┌──────────────┐
│ Airflow ├── metrics ──────▶│ Prometheus │──────▶│ PagerDuty │
│ dbt ├── logs ─────────▶│ Grafana │ │ Slack │
│ Spark ├── test results ─▶│ Datadog │ │ Email │
│ Fivetran ├── metadata ─────▶│ Monte Carlo │ │ │
└──────────┘ └──────────────┘ └──────────────┘
Freshness SLAs
| Data Tier | Max Staleness | Check Frequency | Impact of Breach |
|---|---|---|---|
| Real-time | 5 minutes | Every minute | Trading, fraud detection |
| Near-real-time | 1 hour | Every 15 minutes | Operational dashboards |
| Daily | 4 hours past schedule | Every 30 minutes | Business reports |
| Weekly | 24 hours past schedule | Every 4 hours | Analytics, cohort analysis |
-- dbt freshness test
sources:
- name: stripe
tables:
- name: payments
loaded_at_field: updated_at
freshness:
warn_after: {count: 2, period: hour}
error_after: {count: 4, period: hour}
Anomaly Detection
def check_volume_anomaly(table, current_count, lookback_days=30):
"""Detect unusual row counts using statistical bounds."""
historical = get_daily_counts(table, lookback_days)
mean = statistics.mean(historical)
stddev = statistics.stdev(historical)
lower_bound = mean - (3 * stddev)
upper_bound = mean + (3 * stddev)
if current_count < lower_bound:
alert(f"{table}: Row count {current_count} is unusually LOW "
f"(expected {lower_bound:.0f}-{upper_bound:.0f})")
elif current_count > upper_bound:
alert(f"{table}: Row count {current_count} is unusually HIGH "
f"(expected {lower_bound:.0f}-{upper_bound:.0f})")
Anti-Patterns
| Anti-Pattern | Problem | Fix |
|---|---|---|
| Alert on every failure | Alert fatigue, alerts ignored | Categorize: critical vs warning, deduplicate |
| No freshness monitoring | Stale data served without anyone knowing | Freshness SLAs with automated checks |
| Volume checks only | Correct count but wrong data | Combine volume + quality + freshness |
| Page on non-actionable alerts | Engineers wake up, can’t do anything | Every page must have a runbook |
| DBA monitors everything | Bottleneck, slow response | Data team owns their pipeline alerts |
Checklist
- Freshness SLAs defined per data tier
- Pipeline health monitoring (success/fail/duration)
- Volume anomaly detection (statistical bounds)
- Data quality test results tracked and alerted
- Lineage-based impact analysis (which dashboards affected?)
- Alert routing: critical vs warning, right team
- Runbooks for every alert type
- Dashboards: pipeline status, freshness, quality score
:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For data engineering consulting, visit garnetgrid.com. :::