Verified by Garnet Grid

Data Testing & Data Quality Frameworks

Test data pipelines systematically. Covers Great Expectations, dbt tests, data profiling, anomaly detection, schema validation, and building a data quality SLA.

Data bugs are worse than code bugs. A code bug crashes the application — you know immediately. A data bug silently produces wrong numbers for weeks. Revenue reports are off, decisions are made on bad data, and nobody notices until the CFO asks “why do these numbers disagree?” Data testing catches these bugs before they become business disasters.


Data Quality Dimensions

DimensionWhat to TestExample Check
CompletenessNo missing required valuesNOT NULL on customer_id
UniquenessNo duplicatesPrimary key uniqueness
AccuracyValues match realityPrice > 0, age between 0-150
ConsistencyNo contradictionsorder_date <= ship_date
TimelinessData is fresh enoughSource updated within last 24h
ValidityValues match expected formatsEmail matches pattern, ISO country codes

dbt Tests

# schema.yml
models:
  - name: fct_orders
    description: "One row per order line item"
    columns:
      - name: order_id
        tests:
          - not_null
          - unique
      
      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('dim_customer')
              field: customer_id
      
      - name: total_amount
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0
              max_value: 1000000
              inclusive: true
      
      - name: order_date
        tests:
          - not_null
          - dbt_utils.expression_is_true:
              expression: "<= current_date"
              config:
                severity: error

Custom dbt Test

-- tests/assert_orders_revenue_matches_payments.sql
-- Verify total orders revenue matches total payments

WITH order_total AS (
    SELECT SUM(total_amount) AS total FROM {{ ref('fct_orders') }}
    WHERE order_date >= CURRENT_DATE - INTERVAL '1 day'
),
payment_total AS (
    SELECT SUM(amount) AS total FROM {{ ref('fct_payments') }}
    WHERE payment_date >= CURRENT_DATE - INTERVAL '1 day'
)

SELECT
    order_total.total AS order_revenue,
    payment_total.total AS payment_revenue,
    ABS(order_total.total - payment_total.total) AS difference
FROM order_total, payment_total
WHERE ABS(order_total.total - payment_total.total) > 0.01
-- Returns rows only if there's a mismatch (test fails)

Great Expectations

import great_expectations as gx

context = gx.get_context()

# Define expectations for orders table
suite = context.add_expectation_suite("orders_quality")

# Column expectations
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToNotBeNull(column="order_id")
)
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeUnique(column="order_id")
)
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeBetween(
        column="total_amount",
        min_value=0,
        max_value=1000000,
    )
)

# Row count expectations (catch data loss)
suite.add_expectation(
    gx.expectations.ExpectTableRowCountToBeBetween(
        min_value=1000,   # At least 1000 orders per day
        max_value=100000,  # Sanity check upper bound
    )
)

Anti-Patterns

Anti-PatternProblemFix
No data testsBad data discovered by business usersdbt tests + Great Expectations on every model
Testing only schemaData type correct but values wrongContent-level tests (ranges, relationships, aggregates)
No row count checksSource stopped sending data, nobody noticedMin/max row count expectations per table
Tests only in prodBugs reach productionTest in CI before deploying to production
No data freshness monitoringStale data served in dashboardsFreshness tests with SLA alerts

Checklist

  • Data quality dimensions defined per dataset
  • dbt tests on every model (not_null, unique, relationships)
  • Custom tests for business rules (cross-table validation)
  • Row count tests to detect data loss or duplication
  • Freshness monitoring with SLA alerts
  • Great Expectations (or equivalent) for source data validation
  • Data quality dashboards (pass rate, failure trends)
  • Incident response for data quality failures

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For data quality consulting, visit garnetgrid.com. :::

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →