Data Quality Monitoring in Production

Bad data is the silent killer of analytics and ML systems. A model trained on correct data but served corrupted features will produce confident wrong answers. A dashboard built on an upstream schema change will show zeros without any error. Data quality issues are harder to detect than application bugs because the system doesn’t crash — it just produces wrong results that erode trust.

Production data quality monitoring requires automated validation at every stage of the pipeline, clear ownership of data contracts, and fast alerting when quality degrades.

The Data Quality Dimensions

Dimension	Definition	Example Check
Completeness	No missing values where expected	NULL rate < 1% for required fields
Accuracy	Values match real-world truth	Revenue figures reconcile with source
Consistency	Same data, same answer across systems	Customer count matches between CRM and warehouse
Freshness	Data arrives on time	Table updated within last 4 hours
Uniqueness	No duplicate records	Primary key uniqueness check
Validity	Values conform to expected format/range	Email contains @, age between 0-150

Data Contracts

A data contract is a formal agreement between a data producer and consumer about the schema, semantics, and quality of a dataset.

# contracts/orders.yaml
contract:
  name: orders
  owner: payments-team
  sla:
    freshness: 1h
    availability: 99.9%
  
  schema:
    - name: order_id
      type: string
      required: true
      unique: true
    - name: customer_id
      type: string
      required: true
    - name: total_amount
      type: decimal
      required: true
      constraints:
        min: 0
        max: 1000000
    - name: status
      type: string
      required: true
      allowed_values: [pending, confirmed, shipped, delivered, cancelled]
    - name: created_at
      type: timestamp
      required: true
      constraints:
        not_in_future: true
  
  quality_rules:
    - name: no_negative_totals
      sql: "SELECT COUNT(*) FROM orders WHERE total_amount < 0"
      threshold: 0
    - name: valid_status_transitions
      sql: "SELECT COUNT(*) FROM orders WHERE status = 'delivered' AND shipped_at IS NULL"
      threshold: 0

Automated Validation Pipeline

class DataQualityMonitor:
    def __init__(self, contract: dict):
        self.contract = contract
        self.alerts = []
    
    def validate(self, df: pd.DataFrame) -> dict:
        results = {
            'completeness': self._check_completeness(df),
            'validity': self._check_validity(df),
            'freshness': self._check_freshness(df),
            'uniqueness': self._check_uniqueness(df),
            'custom_rules': self._check_custom_rules(df),
        }
        
        results['score'] = sum(
            1 for v in results.values() 
            if isinstance(v, dict) and v.get('passed', False)
        ) / len(results)
        
        return results
    
    def _check_completeness(self, df):
        issues = {}
        for col in self.contract['schema']:
            if col['required']:
                null_rate = df[col['name']].isna().mean()
                if null_rate > 0.01:
                    issues[col['name']] = f'{null_rate:.2%} null'
        return {'passed': len(issues) == 0, 'issues': issues}
    
    def _check_freshness(self, df):
        if 'created_at' in df.columns:
            max_time = df['created_at'].max()
            age = (pd.Timestamp.now() - max_time).total_seconds() / 3600
            sla_hours = int(self.contract['sla']['freshness'].replace('h', ''))
            return {'passed': age <= sla_hours, 'age_hours': round(age, 1)}
        return {'passed': True}

Anomaly Detection for Data Quality

Statistical checks catch issues that rule-based validation misses:

Volume Anomalies

-- Alert if today's row count deviates > 3 sigma from 30-day average
WITH daily_counts AS (
    SELECT DATE(created_at) as dt, COUNT(*) as cnt
    FROM orders
    WHERE created_at >= CURRENT_DATE - 30
    GROUP BY DATE(created_at)
)
SELECT 
    cnt as today_count,
    AVG(cnt) as avg_count,
    STDDEV(cnt) as std_count,
    ABS(cnt - AVG(cnt)) / NULLIF(STDDEV(cnt), 0) as z_score
FROM daily_counts
WHERE dt = CURRENT_DATE
HAVING z_score > 3;

Distribution Shift

Monitor the Population Stability Index (PSI) for each column:

PSI < 0.1: No significant shift
PSI 0.1-0.25: Moderate shift — investigate
PSI > 0.25: Significant shift — alert immediately

Building a Data Quality Culture

Assign ownership: Every dataset has an owning team responsible for its quality
Measure and report: Weekly data quality score per team, visible to leadership
Block on failure: Critical quality checks gate downstream pipelines — don’t process bad data
Incident process: Treat data quality incidents like production outages
Shift left: Validate data at ingestion, not consumption

The organizations with the best data quality don’t have the best monitoring tools — they have cultures where data producers feel responsible for the quality of what they ship.

The Data Quality Dimensions

Data Contracts

Automated Validation Pipeline

Anomaly Detection for Data Quality

Volume Anomalies

Distribution Shift

Building a Data Quality Culture

More in Data Science

A/B Testing at Scale

A/B Testing Statistical Framework

A/B Testing Infrastructure: Making Data-Driven Decisions Without Breaking Production