Data Contracts for Pipeline Reliability
Implement data contracts between producers and consumers. Covers schema registries, contract testing, versioning strategies, breaking change management, and organizational adoption.
Data contracts are explicit agreements between data producers and data consumers about what data will look like, when it will arrive, and what quality guarantees it carries. Without contracts, any upstream change — a renamed column, a new enum value, a format change — silently breaks every downstream pipeline, dashboard, and ML model.
This guide covers how to implement data contracts technically and organizationally, turning implicit assumptions into enforceable agreements.
What a Data Contract Contains
# data-contract.yaml
contract:
name: "orders"
version: "2.1.0"
owner: "commerce-team"
schema:
type: "object"
properties:
order_id:
type: string
format: uuid
description: "Unique order identifier"
pii: false
customer_id:
type: string
description: "Customer who placed the order"
pii: true
amount:
type: number
minimum: 0.01
maximum: 1000000
description: "Order total in USD"
status:
type: string
enum: ["pending", "processing", "shipped", "delivered", "cancelled"]
created_at:
type: string
format: "date-time"
description: "ISO 8601 timestamp"
quality:
freshness:
max_delay: "1 hour"
completeness:
order_id: 100%
customer_id: 100%
amount: 100%
volume:
min_daily_records: 5000
max_daily_records: 50000
sla:
availability: "99.9%"
support_channel: "#commerce-data"
consumers:
- team: "analytics"
use_case: "Revenue dashboards"
- team: "ml-platform"
use_case: "Churn prediction features"
Schema Registry
# Avro schema with compatibility enforcement
ORDERS_SCHEMA_V2 = {
"type": "record",
"name": "Order",
"namespace": "com.company.commerce",
"fields": [
{"name": "order_id", "type": "string"},
{"name": "customer_id", "type": "string"},
{"name": "amount", "type": "double"},
{"name": "status", "type": {"type": "enum", "name": "Status",
"symbols": ["PENDING", "PROCESSING", "SHIPPED", "DELIVERED", "CANCELLED"]}},
{"name": "created_at", "type": {"type": "long", "logicalType": "timestamp-millis"}},
# New field in v2 — backward compatible (has default)
{"name": "currency", "type": "string", "default": "USD"},
]
}
Compatibility Modes
| Mode | New Schema Can | Use Case |
|---|---|---|
| BACKWARD | Read old data | Consumers upgrade before producers |
| FORWARD | Be read by old consumers | Producers upgrade before consumers |
| FULL | Both backward and forward | Independent upgrades |
| NONE | Break anything | Development only, never production |
Contract Testing in CI/CD
# GitHub Actions: validate contracts on every PR
contract-validation:
runs-on: ubuntu-latest
steps:
- name: Schema Compatibility Check
run: |
# Check new schema is backward compatible with registered version
curl -X POST "https://schema-registry:8081/compatibility/subjects/orders-value/versions/latest" \
-H "Content-Type: application/vnd.schemaregistry.v1+json" \
-d @new-schema.json
- name: Contract Test (Producer)
run: |
# Verify producer output matches contract
python -m pytest tests/contract/ \
--contract data-contracts/orders.yaml \
--sample-data tests/fixtures/orders_sample.json
- name: Notify Consumers
if: steps.schema-check.outputs.breaking == 'true'
run: |
# Alert consuming teams about breaking change
slack-notify --channel "#data-contracts" \
--message "⚠️ Breaking change proposed for 'orders' contract"
Versioning Strategy
| Change Type | Version Bump | Example | Breaking? |
|---|---|---|---|
| Add optional field | Minor (2.0 → 2.1) | Add currency with default | No |
| Add required field | Major (2.1 → 3.0) | New warehouse_id required | Yes |
| Remove field | Major | Drop legacy_status | Yes |
| Change field type | Major | amount: string → amount: number | Yes |
| Add enum value | Minor | Add “REFUNDED” to status | No (if consumer handles unknown) |
| Rename field | Major | created_at → order_time | Yes |
Anti-Patterns
| Anti-Pattern | Problem | Fix |
|---|---|---|
| No contracts (implicit assumptions) | Upstream changes break downstream silently | Explicit contracts with schema + quality SLAs |
| Contracts without enforcement | Contracts exist but nobody checks them | CI/CD validation, runtime schema validation |
| Producer-only contracts | Consumers not aware of changes | Consumer registration, change notification |
| No versioning | Can’t evolve schemas without breaking | Semantic versioning with compatibility checks |
| Contracts as documentation | Written but never tested | Contract tests run in CI, production validation |
Checklist
- Data contracts defined for all critical datasets
- Schema registry deployed (Confluent, AWS Glue, or custom)
- Compatibility mode set: BACKWARD or FULL
- Contract testing in CI/CD: schema validation on every PR
- Consumer registry: know who depends on each dataset
- Breaking change process: notification + migration period
- Runtime validation: schemas enforced at write time
- Quality SLAs: freshness, completeness, volume in contract
- Ownership: every contract has a team owner and support channel
:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For data contract consulting, visit garnetgrid.com. :::