Data Quality Frameworks
Build systematic data quality management into your data pipelines. Covers data quality dimensions, Great Expectations framework, data contracts, schema validation, data profiling, quality metrics, and the patterns that catch data problems before they reach consumers.
Bad data is worse than no data. A missing field causes a visible error. A subtly wrong field — a currency in cents instead of dollars, a timestamp in UTC instead of local time — leads to wrong decisions that nobody questions. Data quality frameworks catch these problems systematically, at the point of ingestion, not after the CEO asks “why are the numbers wrong?”
Data Quality Dimensions
Completeness: Are all required fields present?
Example: 15% of customer records missing email
Accuracy: Do values reflect reality?
Example: Birth date of 2099-01-01
Consistency: Do values agree across systems?
Example: Customer "active" in CRM but "deleted" in billing
Timeliness: Is data available when needed?
Example: Yesterday's sales data not available until noon
Uniqueness: Are there duplicate records?
Example: Same customer appears 3 times with different IDs
Validity: Do values conform to expected format/range?
Example: Phone number with 15 digits, negative age
Great Expectations
import great_expectations as gx
# Define expectations (data quality rules)
context = gx.get_context()
# Create expectations for a dataset
suite = context.add_expectation_suite("orders_quality")
# Completeness
suite.add_expectation(
gx.expectations.ExpectColumnValuesToNotBeNull(column="order_id")
)
suite.add_expectation(
gx.expectations.ExpectColumnValuesToNotBeNull(column="customer_id")
)
# Validity
suite.add_expectation(
gx.expectations.ExpectColumnValuesToBeBetween(
column="order_total",
min_value=0,
max_value=1_000_000,
)
)
# Uniqueness
suite.add_expectation(
gx.expectations.ExpectColumnValuesToBeUnique(column="order_id")
)
# Consistency
suite.add_expectation(
gx.expectations.ExpectColumnValuesToBeInSet(
column="status",
value_set=["pending", "processing", "shipped", "delivered", "cancelled"],
)
)
# Timeliness
suite.add_expectation(
gx.expectations.ExpectColumnValuesToBeBetween(
column="created_at",
min_value="2020-01-01",
max_value=datetime.now().isoformat(),
)
)
# Run validation
results = context.run_checkpoint(
checkpoint_name="orders_checkpoint",
batch_request=batch_request,
)
if not results.success:
# Alert, quarantine, or halt pipeline
alert_data_team(results)
quarantine_bad_records(results.get_failed_rows())
Data Contracts
# data-contract.yaml
# Agreement between producer and consumer about data shape
contract:
name: "customer_events"
version: "2.1.0"
owner: "platform-team"
consumers: ["analytics", "marketing", "billing"]
schema:
fields:
- name: customer_id
type: string
required: true
format: "uuid"
description: "Unique customer identifier"
- name: event_type
type: string
required: true
enum: ["signup", "purchase", "churn", "upgrade"]
- name: timestamp
type: timestamp
required: true
timezone: "UTC"
- name: revenue
type: decimal
required: false
precision: 2
currency: "USD"
sla:
freshness: "< 5 minutes"
completeness: "> 99.5%"
availability: "99.9%"
breaking_changes:
notice_period: "30 days"
process: "RFC to all consumers before schema change"
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| Validate only at the end | Bad data propagates through pipeline | Validate at ingestion, transformation, and delivery |
| Silent data loss | Records dropped without notification | Dead letter queue + monitoring for rejects |
| Schema-only validation | Catches format errors, misses semantic errors | Add business rule validation |
| No data profiling | Cannot detect drift over time | Regular profiling to establish baselines |
| Manual quality checks | Unsustainable, inconsistent, delayed | Automated quality gates in pipeline |
Data quality is not a one-time check — it is a continuous process. Embed quality gates into every stage of your pipeline, define contracts between producers and consumers, and treat data quality failures with the same urgency as production incidents.