ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

Data Quality Monitoring in Production

How to monitor data quality in production pipelines. Covers data contracts, schema validation, anomaly detection, lineage tracking, and building a data quality culture.

Bad data is the silent killer of analytics and ML systems. A model trained on correct data but served corrupted features will produce confident wrong answers. A dashboard built on an upstream schema change will show zeros without any error. Data quality issues are harder to detect than application bugs because the system doesn’t crash — it just produces wrong results that erode trust.

Production data quality monitoring requires automated validation at every stage of the pipeline, clear ownership of data contracts, and fast alerting when quality degrades.


The Data Quality Dimensions

DimensionDefinitionExample Check
CompletenessNo missing values where expectedNULL rate < 1% for required fields
AccuracyValues match real-world truthRevenue figures reconcile with source
ConsistencySame data, same answer across systemsCustomer count matches between CRM and warehouse
FreshnessData arrives on timeTable updated within last 4 hours
UniquenessNo duplicate recordsPrimary key uniqueness check
ValidityValues conform to expected format/rangeEmail contains @, age between 0-150

Data Contracts

A data contract is a formal agreement between a data producer and consumer about the schema, semantics, and quality of a dataset.

# contracts/orders.yaml
contract:
  name: orders
  owner: payments-team
  sla:
    freshness: 1h
    availability: 99.9%
  
  schema:
    - name: order_id
      type: string
      required: true
      unique: true
    - name: customer_id
      type: string
      required: true
    - name: total_amount
      type: decimal
      required: true
      constraints:
        min: 0
        max: 1000000
    - name: status
      type: string
      required: true
      allowed_values: [pending, confirmed, shipped, delivered, cancelled]
    - name: created_at
      type: timestamp
      required: true
      constraints:
        not_in_future: true
  
  quality_rules:
    - name: no_negative_totals
      sql: "SELECT COUNT(*) FROM orders WHERE total_amount < 0"
      threshold: 0
    - name: valid_status_transitions
      sql: "SELECT COUNT(*) FROM orders WHERE status = 'delivered' AND shipped_at IS NULL"
      threshold: 0

Automated Validation Pipeline

class DataQualityMonitor:
    def __init__(self, contract: dict):
        self.contract = contract
        self.alerts = []
    
    def validate(self, df: pd.DataFrame) -> dict:
        results = {
            'completeness': self._check_completeness(df),
            'validity': self._check_validity(df),
            'freshness': self._check_freshness(df),
            'uniqueness': self._check_uniqueness(df),
            'custom_rules': self._check_custom_rules(df),
        }
        
        results['score'] = sum(
            1 for v in results.values() 
            if isinstance(v, dict) and v.get('passed', False)
        ) / len(results)
        
        return results
    
    def _check_completeness(self, df):
        issues = {}
        for col in self.contract['schema']:
            if col['required']:
                null_rate = df[col['name']].isna().mean()
                if null_rate > 0.01:
                    issues[col['name']] = f'{null_rate:.2%} null'
        return {'passed': len(issues) == 0, 'issues': issues}
    
    def _check_freshness(self, df):
        if 'created_at' in df.columns:
            max_time = df['created_at'].max()
            age = (pd.Timestamp.now() - max_time).total_seconds() / 3600
            sla_hours = int(self.contract['sla']['freshness'].replace('h', ''))
            return {'passed': age <= sla_hours, 'age_hours': round(age, 1)}
        return {'passed': True}

Anomaly Detection for Data Quality

Statistical checks catch issues that rule-based validation misses:

Volume Anomalies

-- Alert if today's row count deviates > 3 sigma from 30-day average
WITH daily_counts AS (
    SELECT DATE(created_at) as dt, COUNT(*) as cnt
    FROM orders
    WHERE created_at >= CURRENT_DATE - 30
    GROUP BY DATE(created_at)
)
SELECT 
    cnt as today_count,
    AVG(cnt) as avg_count,
    STDDEV(cnt) as std_count,
    ABS(cnt - AVG(cnt)) / NULLIF(STDDEV(cnt), 0) as z_score
FROM daily_counts
WHERE dt = CURRENT_DATE
HAVING z_score > 3;

Distribution Shift

Monitor the Population Stability Index (PSI) for each column:

  • PSI < 0.1: No significant shift
  • PSI 0.1-0.25: Moderate shift — investigate
  • PSI > 0.25: Significant shift — alert immediately

Building a Data Quality Culture

  1. Assign ownership: Every dataset has an owning team responsible for its quality
  2. Measure and report: Weekly data quality score per team, visible to leadership
  3. Block on failure: Critical quality checks gate downstream pipelines — don’t process bad data
  4. Incident process: Treat data quality incidents like production outages
  5. Shift left: Validate data at ingestion, not consumption

The organizations with the best data quality don’t have the best monitoring tools — they have cultures where data producers feel responsible for the quality of what they ship.

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →