Data Quality Monitoring in Production
How to monitor data quality in production pipelines. Covers data contracts, schema validation, anomaly detection, lineage tracking, and building a data quality culture.
Bad data is the silent killer of analytics and ML systems. A model trained on correct data but served corrupted features will produce confident wrong answers. A dashboard built on an upstream schema change will show zeros without any error. Data quality issues are harder to detect than application bugs because the system doesn’t crash — it just produces wrong results that erode trust.
Production data quality monitoring requires automated validation at every stage of the pipeline, clear ownership of data contracts, and fast alerting when quality degrades.
The Data Quality Dimensions
| Dimension | Definition | Example Check |
|---|---|---|
| Completeness | No missing values where expected | NULL rate < 1% for required fields |
| Accuracy | Values match real-world truth | Revenue figures reconcile with source |
| Consistency | Same data, same answer across systems | Customer count matches between CRM and warehouse |
| Freshness | Data arrives on time | Table updated within last 4 hours |
| Uniqueness | No duplicate records | Primary key uniqueness check |
| Validity | Values conform to expected format/range | Email contains @, age between 0-150 |
Data Contracts
A data contract is a formal agreement between a data producer and consumer about the schema, semantics, and quality of a dataset.
# contracts/orders.yaml
contract:
name: orders
owner: payments-team
sla:
freshness: 1h
availability: 99.9%
schema:
- name: order_id
type: string
required: true
unique: true
- name: customer_id
type: string
required: true
- name: total_amount
type: decimal
required: true
constraints:
min: 0
max: 1000000
- name: status
type: string
required: true
allowed_values: [pending, confirmed, shipped, delivered, cancelled]
- name: created_at
type: timestamp
required: true
constraints:
not_in_future: true
quality_rules:
- name: no_negative_totals
sql: "SELECT COUNT(*) FROM orders WHERE total_amount < 0"
threshold: 0
- name: valid_status_transitions
sql: "SELECT COUNT(*) FROM orders WHERE status = 'delivered' AND shipped_at IS NULL"
threshold: 0
Automated Validation Pipeline
class DataQualityMonitor:
def __init__(self, contract: dict):
self.contract = contract
self.alerts = []
def validate(self, df: pd.DataFrame) -> dict:
results = {
'completeness': self._check_completeness(df),
'validity': self._check_validity(df),
'freshness': self._check_freshness(df),
'uniqueness': self._check_uniqueness(df),
'custom_rules': self._check_custom_rules(df),
}
results['score'] = sum(
1 for v in results.values()
if isinstance(v, dict) and v.get('passed', False)
) / len(results)
return results
def _check_completeness(self, df):
issues = {}
for col in self.contract['schema']:
if col['required']:
null_rate = df[col['name']].isna().mean()
if null_rate > 0.01:
issues[col['name']] = f'{null_rate:.2%} null'
return {'passed': len(issues) == 0, 'issues': issues}
def _check_freshness(self, df):
if 'created_at' in df.columns:
max_time = df['created_at'].max()
age = (pd.Timestamp.now() - max_time).total_seconds() / 3600
sla_hours = int(self.contract['sla']['freshness'].replace('h', ''))
return {'passed': age <= sla_hours, 'age_hours': round(age, 1)}
return {'passed': True}
Anomaly Detection for Data Quality
Statistical checks catch issues that rule-based validation misses:
Volume Anomalies
-- Alert if today's row count deviates > 3 sigma from 30-day average
WITH daily_counts AS (
SELECT DATE(created_at) as dt, COUNT(*) as cnt
FROM orders
WHERE created_at >= CURRENT_DATE - 30
GROUP BY DATE(created_at)
)
SELECT
cnt as today_count,
AVG(cnt) as avg_count,
STDDEV(cnt) as std_count,
ABS(cnt - AVG(cnt)) / NULLIF(STDDEV(cnt), 0) as z_score
FROM daily_counts
WHERE dt = CURRENT_DATE
HAVING z_score > 3;
Distribution Shift
Monitor the Population Stability Index (PSI) for each column:
- PSI < 0.1: No significant shift
- PSI 0.1-0.25: Moderate shift — investigate
- PSI > 0.25: Significant shift — alert immediately
Building a Data Quality Culture
- Assign ownership: Every dataset has an owning team responsible for its quality
- Measure and report: Weekly data quality score per team, visible to leadership
- Block on failure: Critical quality checks gate downstream pipelines — don’t process bad data
- Incident process: Treat data quality incidents like production outages
- Shift left: Validate data at ingestion, not consumption
The organizations with the best data quality don’t have the best monitoring tools — they have cultures where data producers feel responsible for the quality of what they ship.