Data governance fails when it’s treated as a bureaucratic exercise. It succeeds when people see it as “being able to trust the numbers in the dashboard.” 77% of organizations say they struggle with data quality, yet only 25% have a formal governance program. This guide gives you the practical implementation path that creates trust without creating a bureaucracy that slows everyone down.
The fundamental tension: governance adds overhead. Too much governance kills productivity. Too little governance kills trust. The goal is the minimum governance that produces reliable data.
Step 1: Establish Data Ownership
Every dataset needs exactly one owner. Not a committee — a person. When nobody owns data, nobody is accountable for its quality.
| Data Domain | Owner Role | Steward | Responsibilities |
|---|
| Customer | Head of Sales | CRM Admin | Define what “customer” means, quality rules |
| Financial | CFO / Controller | Finance Analyst | Accuracy of reporting figures |
| Product | VP Product | Product Ops | Catalog accuracy, pricing integrity |
| Employee | CHRO | HR Systems Admin | PII handling, access controls |
| Operational | COO | Data Engineer | Pipeline uptime, data freshness |
Owner vs Steward vs Custodian
| Role | Responsibility | Example Person | Key Question |
|---|
| Data Owner | Defines business rules, approves access | VP of Sales | ”What does this data mean?” |
| Data Steward | Implements quality rules, resolves issues | CRM Admin, Sr Analyst | ”Is this data correct today?” |
| Data Custodian | Manages infrastructure, security | Data Engineer, DBA | ”Is this data available and protected?” |
| Data Consumer | Uses data for decisions | Business Analyst | ”Can I trust this number?” |
RACI for Data Decisions
| Decision | Owner | Steward | Data Eng | Consumers |
|---|
| Define business rules | A | R | C | I |
| Data quality thresholds | A | R | C | I |
| Schema changes | C | A | R | I |
| Access requests | A | R | C | I |
| Incident response | I | A | R | C |
Step 2: Build Your Data Catalog
A catalog without adoption is shelf-ware. Make it the first place people go to find data.
# Example: Register datasets in a lightweight catalog
# Using Great Expectations for documentation
import great_expectations as gx
context = gx.get_context()
# Add a data source
datasource = context.sources.add_postgres(
"production_db",
connection_string="postgresql://..."
)
# Create an expectation suite (quality contract)
suite = context.add_expectation_suite("customers_quality")
# Define expectations
suite.add_expectation(
gx.expectations.ExpectColumnValuesToNotBeNull(column="customer_id")
)
suite.add_expectation(
gx.expectations.ExpectColumnValuesToBeUnique(column="email")
)
suite.add_expectation(
gx.expectations.ExpectColumnValuesToBeBetween(
column="lifetime_value", min_value=0, max_value=10000000
)
)
# Run validation
results = context.run_checkpoint("daily_checkpoint")
print(f"Validation passed: {results.success}")
| Tool | Best For | Cost | Open Source? |
|---|
| DataHub | LinkedIn-scale, flexible metadata | Free | Yes |
| OpenMetadata | Full-featured, modern UI | Free | Yes |
| Atlan | Business-user friendly, collaboration | $$$$ | No |
| Alation | Enterprise ML-powered cataloging | $$$$$ | No |
| dbt Docs | dbt-centric teams, lightweight | Free | Yes |
| AWS Glue Catalog | AWS-native, auto-crawling | Per-use | No |
What to Document Per Dataset
| Metadata Field | Example | Who Maintains |
|---|
| Description | ”All active customer accounts with lifetime metrics” | Data Steward |
| Owner | sales_team | Data Owner |
| SLA | Refreshed by 6 AM daily | Data Engineer |
| PII flag | Yes (email, phone) | Data Steward |
| Source system | Salesforce CRM → Fivetran → Snowflake | Data Engineer |
| Quality score | 94% (3 of 50 checks failing) | Automated |
| Downstream consumers | Revenue dashboard, Churn model, Finance report | Auto-discovered |
| Freshness | Last updated 2 hours ago | Automated |
Step 3: Implement Data Quality Rules
Quality Dimensions
| Dimension | Definition | Example Check | Automation |
|---|
| Completeness | No critical nulls | NOT NULL on required fields | dbt test, Great Expectations |
| Accuracy | Values match reality | Revenue matches source system | Cross-system reconciliation |
| Consistency | Same value everywhere | Customer name same in CRM + billing | Cross-database assertions |
| Timeliness | Data is fresh enough | Dashboard updates within 1 hour | Freshness monitoring |
| Uniqueness | No duplicates | Primary key uniqueness | dbt_utils.unique_combination_of_columns |
| Validity | Conforms to business rules | Email matches regex, age > 0 | Schema tests + custom rules |
Automated Quality Pipeline
-- Daily data quality checks (dbt tests pattern)
-- Test: No null customer IDs
SELECT COUNT(*) AS failures
FROM customers
WHERE customer_id IS NULL;
-- Test: Email format validation
SELECT COUNT(*) AS failures
FROM customers
WHERE email NOT LIKE '%_@_%.__%';
-- Test: Revenue must be positive
SELECT COUNT(*) AS failures
FROM orders
WHERE total_amount < 0;
-- Test: Referential integrity
SELECT COUNT(*) AS failures
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id
WHERE c.customer_id IS NULL;
-- Test: Data freshness (updated within 2 hours)
SELECT CASE
WHEN MAX(updated_at) < NOW() - INTERVAL '2 hours'
THEN 1 ELSE 0
END AS stale
FROM orders;
Quality Scoring Framework
# Calculate a data quality score per dataset
def calculate_quality_score(dataset_name: str, checks: list[dict]) -> float:
"""
Score = (passed checks / total checks) × 100
Weight critical checks more heavily.
"""
total_weight = sum(check["weight"] for check in checks)
passed_weight = sum(
check["weight"] for check in checks if check["passed"]
)
score = (passed_weight / total_weight) * 100
# Alert if below threshold
if score < 90:
alert_data_steward(dataset_name, score, checks)
return round(score, 1)
# Example checks
checks = [
{"name": "no_null_ids", "weight": 3, "passed": True},
{"name": "valid_emails", "weight": 2, "passed": True},
{"name": "fresh_data", "weight": 3, "passed": False}, # Stale!
{"name": "no_duplicates", "weight": 2, "passed": True},
]
score = calculate_quality_score("customers", checks)
print(f"Quality score: {score}%") # 70% — freshness check failed
When Quality Checks Should Run
| Check Type | Frequency | Examples |
|---|
| Schema validation | Every pipeline run | Column types, nullable constraints |
| Row count anomaly | Every pipeline run | ±10% vs yesterday |
| Freshness | Every 30 minutes | Last update timestamp |
| Business rules | Daily | Revenue > 0, valid geographies |
| Cross-system reconciliation | Weekly | CRM total = DW total |
| Full data profiling | Monthly | Distribution analysis, drift detection |
Step 4: Track Data Lineage
Understanding where data comes from and where it goes is essential for trust and debugging.
Source Systems Transformation Consumption
┌──────────┐ ┌─────────────────┐ ┌──────────────┐
│ CRM │────▶│ ETL Pipeline │────▶│ Dashboard │
│ (SFDC) │ │ (Airflow/dbt) │ │ (Power BI) │
└──────────┘ └─────────────────┘ └──────────────┘
┌──────────┐ │ ┌──────────────┐
│ ERP │────▶───────┤ │ ML Model │
│ (D365) │ │ │ (Forecast) │
└──────────┘ ▼ └──────────────┘
┌──────────┐ ┌─────────────────┐ ┌──────────────┐
│ Website │────▶│ Data Warehouse │────▶│ Ad-hoc SQL │
│ (GA4) │ │ (Snowflake) │ │ (Analysts) │
└──────────┘ └─────────────────┘ └──────────────┘
Why Lineage Matters
| Scenario | Without Lineage | With Lineage |
|---|
| Dashboard shows wrong number | ”Where does this come from?” (hours of investigation) | Click through lineage graph (5 minutes) |
| Source system schema change | All downstream breaks unpredictably | Impact analysis before change |
| Data quality failure | ”What’s affected?” (nobody knows) | Automated downstream alerting |
| Compliance audit (GDPR) | “Where is PII stored?” (weeks of research) | Search by classification tag |
dbt Lineage
# dbt model with documentation and lineage
# models/marts/customers.sql
{{ config(
materialized='table',
description='Customer dimension with lifetime metrics',
meta={
'owner': 'sales_team',
'sla': 'refreshed by 6 AM daily',
'pii': true
}
) }}
SELECT
c.customer_id,
c.full_name,
c.email,
c.created_at,
COALESCE(SUM(o.total_amount), 0) AS lifetime_value,
COUNT(o.order_id) AS total_orders,
MAX(o.order_date) AS last_order_date
FROM {{ ref('stg_customers') }} c
LEFT JOIN {{ ref('stg_orders') }} o
ON c.customer_id = o.customer_id
GROUP BY 1, 2, 3, 4
Step 5: Access Control and Classification
Data Classification Tiers
| Tier | Label | Examples | Access | Encryption |
|---|
| Public | 🟢 | Marketing content, pricing | Anyone | Optional |
| Internal | 🟡 | Revenue metrics, KPIs | All employees | In transit |
| Confidential | 🟠 | Customer PII, contracts | Need-to-know | At rest + transit |
| Restricted | 🔴 | SSN, payment data, health | Role-based + audit | At rest + transit + key management |
Implementation
-- Row-level security in PostgreSQL
CREATE POLICY region_access ON customers
FOR SELECT
USING (region = current_setting('app.user_region'));
-- Column-level masking
CREATE VIEW customers_masked AS
SELECT
customer_id,
full_name,
CASE
WHEN current_user IN ('analyst', 'report_user')
THEN '***@***.***'
ELSE email
END AS email,
CASE
WHEN current_user IN ('analyst', 'report_user')
THEN 'XXX-XX-' || RIGHT(ssn, 4)
ELSE ssn
END AS ssn
FROM customers;
Governance Maturity Model
| Level | Description | Key Indicators |
|---|
| 1 - Ad Hoc | No formal governance | No ownership, no catalog, tribal knowledge |
| 2 - Reactive | Governance exists but triggered by incidents | Owners assigned after a data incident |
| 3 - Proactive | Formal program with automated controls | Catalog, quality checks, scheduled reviews |
| 4 - Managed | Metrics-driven, continuously improving | Quality scores, SLA tracking, compliance dashboards |
| 5 - Optimized | Self-service with embedded governance | Automated lineage, ML-powered anomaly detection |
Governance Checklist
:::note[Source]
This guide is derived from operational intelligence at Garnet Grid Consulting. For data strategy consulting, visit garnetgrid.com.
:::