ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

Data Quality Frameworks

Build systematic data quality management into your data pipelines. Covers data quality dimensions, Great Expectations framework, data contracts, schema validation, data profiling, quality metrics, and the patterns that catch data problems before they reach consumers.

Bad data is worse than no data. A missing field causes a visible error. A subtly wrong field — a currency in cents instead of dollars, a timestamp in UTC instead of local time — leads to wrong decisions that nobody questions. Data quality frameworks catch these problems systematically, at the point of ingestion, not after the CEO asks “why are the numbers wrong?”


Data Quality Dimensions

Completeness:  Are all required fields present?
               Example: 15% of customer records missing email
               
Accuracy:      Do values reflect reality?
               Example: Birth date of 2099-01-01

Consistency:   Do values agree across systems?
               Example: Customer "active" in CRM but "deleted" in billing

Timeliness:    Is data available when needed?
               Example: Yesterday's sales data not available until noon

Uniqueness:    Are there duplicate records?
               Example: Same customer appears 3 times with different IDs

Validity:      Do values conform to expected format/range?
               Example: Phone number with 15 digits, negative age

Great Expectations

import great_expectations as gx

# Define expectations (data quality rules)
context = gx.get_context()

# Create expectations for a dataset
suite = context.add_expectation_suite("orders_quality")

# Completeness
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToNotBeNull(column="order_id")
)
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToNotBeNull(column="customer_id")
)

# Validity
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeBetween(
        column="order_total",
        min_value=0,
        max_value=1_000_000,
    )
)

# Uniqueness
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeUnique(column="order_id")
)

# Consistency
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeInSet(
        column="status",
        value_set=["pending", "processing", "shipped", "delivered", "cancelled"],
    )
)

# Timeliness
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeBetween(
        column="created_at",
        min_value="2020-01-01",
        max_value=datetime.now().isoformat(),
    )
)

# Run validation
results = context.run_checkpoint(
    checkpoint_name="orders_checkpoint",
    batch_request=batch_request,
)

if not results.success:
    # Alert, quarantine, or halt pipeline
    alert_data_team(results)
    quarantine_bad_records(results.get_failed_rows())

Data Contracts

# data-contract.yaml
# Agreement between producer and consumer about data shape

contract:
  name: "customer_events"
  version: "2.1.0"
  owner: "platform-team"
  consumers: ["analytics", "marketing", "billing"]
  
schema:
  fields:
    - name: customer_id
      type: string
      required: true
      format: "uuid"
      description: "Unique customer identifier"
    
    - name: event_type
      type: string
      required: true
      enum: ["signup", "purchase", "churn", "upgrade"]
    
    - name: timestamp
      type: timestamp
      required: true
      timezone: "UTC"
    
    - name: revenue
      type: decimal
      required: false
      precision: 2
      currency: "USD"

sla:
  freshness: "< 5 minutes"
  completeness: "> 99.5%"
  availability: "99.9%"

breaking_changes:
  notice_period: "30 days"
  process: "RFC to all consumers before schema change"

Anti-Patterns

Anti-PatternConsequenceFix
Validate only at the endBad data propagates through pipelineValidate at ingestion, transformation, and delivery
Silent data lossRecords dropped without notificationDead letter queue + monitoring for rejects
Schema-only validationCatches format errors, misses semantic errorsAdd business rule validation
No data profilingCannot detect drift over timeRegular profiling to establish baselines
Manual quality checksUnsustainable, inconsistent, delayedAutomated quality gates in pipeline

Data quality is not a one-time check — it is a continuous process. Embed quality gates into every stage of your pipeline, define contracts between producers and consumers, and treat data quality failures with the same urgency as production incidents.

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →