Data Governance: Building Trust in Your Data

Data governance fails when it’s treated as a bureaucratic exercise. It succeeds when people see it as “being able to trust the numbers in the dashboard.” 77% of organizations say they struggle with data quality, yet only 25% have a formal governance program. This guide gives you the practical implementation path that creates trust without creating a bureaucracy that slows everyone down.

The fundamental tension: governance adds overhead. Too much governance kills productivity. Too little governance kills trust. The goal is the minimum governance that produces reliable data.

Step 1: Establish Data Ownership

Every dataset needs exactly one owner. Not a committee — a person. When nobody owns data, nobody is accountable for its quality.

Data Domain	Owner Role	Steward	Responsibilities
Customer	Head of Sales	CRM Admin	Define what “customer” means, quality rules
Financial	CFO / Controller	Finance Analyst	Accuracy of reporting figures
Product	VP Product	Product Ops	Catalog accuracy, pricing integrity
Employee	CHRO	HR Systems Admin	PII handling, access controls
Operational	COO	Data Engineer	Pipeline uptime, data freshness

Owner vs Steward vs Custodian

Role	Responsibility	Example Person	Key Question
Data Owner	Defines business rules, approves access	VP of Sales	”What does this data mean?”
Data Steward	Implements quality rules, resolves issues	CRM Admin, Sr Analyst	”Is this data correct today?”
Data Custodian	Manages infrastructure, security	Data Engineer, DBA	”Is this data available and protected?”
Data Consumer	Uses data for decisions	Business Analyst	”Can I trust this number?”

RACI for Data Decisions

Decision	Owner	Steward	Data Eng	Consumers
Define business rules	A	R	C	I
Data quality thresholds	A	R	C	I
Schema changes	C	A	R	I
Access requests	A	R	C	I
Incident response	I	A	R	C

Step 2: Build Your Data Catalog

A catalog without adoption is shelf-ware. Make it the first place people go to find data.

# Example: Register datasets in a lightweight catalog
# Using Great Expectations for documentation

import great_expectations as gx

context = gx.get_context()

# Add a data source
datasource = context.sources.add_postgres(
    "production_db",
    connection_string="postgresql://..."
)

# Create an expectation suite (quality contract)
suite = context.add_expectation_suite("customers_quality")

# Define expectations
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToNotBeNull(column="customer_id")
)
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeUnique(column="email")
)
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeBetween(
        column="lifetime_value", min_value=0, max_value=10000000
    )
)

# Run validation
results = context.run_checkpoint("daily_checkpoint")
print(f"Validation passed: {results.success}")

Catalog Tool Comparison

Tool	Best For	Cost	Open Source?
DataHub	LinkedIn-scale, flexible metadata	Free	Yes
OpenMetadata	Full-featured, modern UI	Free	Yes
Atlan	Business-user friendly, collaboration	$$$$	No
Alation	Enterprise ML-powered cataloging	$$$$$	No
dbt Docs	dbt-centric teams, lightweight	Free	Yes
AWS Glue Catalog	AWS-native, auto-crawling	Per-use	No

What to Document Per Dataset

Metadata Field	Example	Who Maintains
Description	”All active customer accounts with lifetime metrics”	Data Steward
Owner	sales_team	Data Owner
SLA	Refreshed by 6 AM daily	Data Engineer
PII flag	Yes (email, phone)	Data Steward
Source system	Salesforce CRM → Fivetran → Snowflake	Data Engineer
Quality score	94% (3 of 50 checks failing)	Automated
Downstream consumers	Revenue dashboard, Churn model, Finance report	Auto-discovered
Freshness	Last updated 2 hours ago	Automated

Step 3: Implement Data Quality Rules

Quality Dimensions

Dimension	Definition	Example Check	Automation
Completeness	No critical nulls	`NOT NULL` on required fields	dbt test, Great Expectations
Accuracy	Values match reality	Revenue matches source system	Cross-system reconciliation
Consistency	Same value everywhere	Customer name same in CRM + billing	Cross-database assertions
Timeliness	Data is fresh enough	Dashboard updates within 1 hour	Freshness monitoring
Uniqueness	No duplicates	Primary key uniqueness	`dbt_utils.unique_combination_of_columns`
Validity	Conforms to business rules	Email matches regex, age > 0	Schema tests + custom rules

Automated Quality Pipeline

-- Daily data quality checks (dbt tests pattern)

-- Test: No null customer IDs
SELECT COUNT(*) AS failures
FROM customers
WHERE customer_id IS NULL;

-- Test: Email format validation
SELECT COUNT(*) AS failures
FROM customers
WHERE email NOT LIKE '%_@_%.__%';

-- Test: Revenue must be positive
SELECT COUNT(*) AS failures
FROM orders
WHERE total_amount < 0;

-- Test: Referential integrity
SELECT COUNT(*) AS failures
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id
WHERE c.customer_id IS NULL;

-- Test: Data freshness (updated within 2 hours)
SELECT CASE
    WHEN MAX(updated_at) < NOW() - INTERVAL '2 hours'
    THEN 1 ELSE 0
END AS stale
FROM orders;

Quality Scoring Framework

# Calculate a data quality score per dataset
def calculate_quality_score(dataset_name: str, checks: list[dict]) -> float:
    """
    Score = (passed checks / total checks) × 100
    Weight critical checks more heavily.
    """
    total_weight = sum(check["weight"] for check in checks)
    passed_weight = sum(
        check["weight"] for check in checks if check["passed"]
    )
    score = (passed_weight / total_weight) * 100

    # Alert if below threshold
    if score < 90:
        alert_data_steward(dataset_name, score, checks)

    return round(score, 1)

# Example checks
checks = [
    {"name": "no_null_ids", "weight": 3, "passed": True},
    {"name": "valid_emails", "weight": 2, "passed": True},
    {"name": "fresh_data", "weight": 3, "passed": False},  # Stale!
    {"name": "no_duplicates", "weight": 2, "passed": True},
]

score = calculate_quality_score("customers", checks)
print(f"Quality score: {score}%")  # 70% — freshness check failed

When Quality Checks Should Run

Check Type	Frequency	Examples
Schema validation	Every pipeline run	Column types, nullable constraints
Row count anomaly	Every pipeline run	±10% vs yesterday
Freshness	Every 30 minutes	Last update timestamp
Business rules	Daily	Revenue > 0, valid geographies
Cross-system reconciliation	Weekly	CRM total = DW total
Full data profiling	Monthly	Distribution analysis, drift detection

Step 4: Track Data Lineage

Understanding where data comes from and where it goes is essential for trust and debugging.

Source Systems          Transformation          Consumption
┌──────────┐     ┌─────────────────┐     ┌──────────────┐
│ CRM      │────▶│  ETL Pipeline   │────▶│  Dashboard   │
│ (SFDC)   │     │  (Airflow/dbt)  │     │  (Power BI)  │
└──────────┘     └─────────────────┘     └──────────────┘
┌──────────┐            │                ┌──────────────┐
│ ERP      │────▶───────┤               │  ML Model    │
│ (D365)   │            │                │  (Forecast)  │
└──────────┘            ▼                └──────────────┘
┌──────────┐     ┌─────────────────┐     ┌──────────────┐
│ Website  │────▶│  Data Warehouse │────▶│  Ad-hoc SQL  │
│ (GA4)    │     │  (Snowflake)    │     │  (Analysts)  │
└──────────┘     └─────────────────┘     └──────────────┘

Why Lineage Matters

Scenario	Without Lineage	With Lineage
Dashboard shows wrong number	”Where does this come from?” (hours of investigation)	Click through lineage graph (5 minutes)
Source system schema change	All downstream breaks unpredictably	Impact analysis before change
Data quality failure	”What’s affected?” (nobody knows)	Automated downstream alerting
Compliance audit (GDPR)	“Where is PII stored?” (weeks of research)	Search by classification tag

dbt Lineage

# dbt model with documentation and lineage
# models/marts/customers.sql
{{ config(
    materialized='table',
    description='Customer dimension with lifetime metrics',
    meta={
        'owner': 'sales_team',
        'sla': 'refreshed by 6 AM daily',
        'pii': true
    }
) }}

SELECT
    c.customer_id,
    c.full_name,
    c.email,
    c.created_at,
    COALESCE(SUM(o.total_amount), 0) AS lifetime_value,
    COUNT(o.order_id) AS total_orders,
    MAX(o.order_date) AS last_order_date
FROM {{ ref('stg_customers') }} c
LEFT JOIN {{ ref('stg_orders') }} o
    ON c.customer_id = o.customer_id
GROUP BY 1, 2, 3, 4

Step 5: Access Control and Classification

Data Classification Tiers

Tier	Label	Examples	Access	Encryption
Public	🟢	Marketing content, pricing	Anyone	Optional
Internal	🟡	Revenue metrics, KPIs	All employees	In transit
Confidential	🟠	Customer PII, contracts	Need-to-know	At rest + transit
Restricted	🔴	SSN, payment data, health	Role-based + audit	At rest + transit + key management

Implementation

-- Row-level security in PostgreSQL
CREATE POLICY region_access ON customers
    FOR SELECT
    USING (region = current_setting('app.user_region'));

-- Column-level masking
CREATE VIEW customers_masked AS
SELECT
    customer_id,
    full_name,
    CASE
        WHEN current_user IN ('analyst', 'report_user')
        THEN '***@***.***'
        ELSE email
    END AS email,
    CASE
        WHEN current_user IN ('analyst', 'report_user')
        THEN 'XXX-XX-' || RIGHT(ssn, 4)
        ELSE ssn
    END AS ssn
FROM customers;

Governance Maturity Model

Level	Description	Key Indicators
1 - Ad Hoc	No formal governance	No ownership, no catalog, tribal knowledge
2 - Reactive	Governance exists but triggered by incidents	Owners assigned after a data incident
3 - Proactive	Formal program with automated controls	Catalog, quality checks, scheduled reviews
4 - Managed	Metrics-driven, continuously improving	Quality scores, SLA tracking, compliance dashboards
5 - Optimized	Self-service with embedded governance	Automated lineage, ML-powered anomaly detection

Governance Checklist

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For data strategy consulting, visit garnetgrid.com. :::