Verified by Garnet Grid

Data Governance: Building Trust in Your Data

Implement data governance that actually works. Covers data catalog setup, quality rules, ownership models, lineage tracking, and compliance automation.

Data governance fails when it’s treated as a bureaucratic exercise. It succeeds when people see it as “being able to trust the numbers in the dashboard.” 77% of organizations say they struggle with data quality, yet only 25% have a formal governance program. This guide gives you the practical implementation path that creates trust without creating a bureaucracy that slows everyone down.

The fundamental tension: governance adds overhead. Too much governance kills productivity. Too little governance kills trust. The goal is the minimum governance that produces reliable data.


Step 1: Establish Data Ownership

Every dataset needs exactly one owner. Not a committee — a person. When nobody owns data, nobody is accountable for its quality.

Data DomainOwner RoleStewardResponsibilities
CustomerHead of SalesCRM AdminDefine what “customer” means, quality rules
FinancialCFO / ControllerFinance AnalystAccuracy of reporting figures
ProductVP ProductProduct OpsCatalog accuracy, pricing integrity
EmployeeCHROHR Systems AdminPII handling, access controls
OperationalCOOData EngineerPipeline uptime, data freshness

Owner vs Steward vs Custodian

RoleResponsibilityExample PersonKey Question
Data OwnerDefines business rules, approves accessVP of Sales”What does this data mean?”
Data StewardImplements quality rules, resolves issuesCRM Admin, Sr Analyst”Is this data correct today?”
Data CustodianManages infrastructure, securityData Engineer, DBA”Is this data available and protected?”
Data ConsumerUses data for decisionsBusiness Analyst”Can I trust this number?”

RACI for Data Decisions

DecisionOwnerStewardData EngConsumers
Define business rulesARCI
Data quality thresholdsARCI
Schema changesCARI
Access requestsARCI
Incident responseIARC

Step 2: Build Your Data Catalog

A catalog without adoption is shelf-ware. Make it the first place people go to find data.

# Example: Register datasets in a lightweight catalog
# Using Great Expectations for documentation

import great_expectations as gx

context = gx.get_context()

# Add a data source
datasource = context.sources.add_postgres(
    "production_db",
    connection_string="postgresql://..."
)

# Create an expectation suite (quality contract)
suite = context.add_expectation_suite("customers_quality")

# Define expectations
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToNotBeNull(column="customer_id")
)
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeUnique(column="email")
)
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeBetween(
        column="lifetime_value", min_value=0, max_value=10000000
    )
)

# Run validation
results = context.run_checkpoint("daily_checkpoint")
print(f"Validation passed: {results.success}")

Catalog Tool Comparison

ToolBest ForCostOpen Source?
DataHubLinkedIn-scale, flexible metadataFreeYes
OpenMetadataFull-featured, modern UIFreeYes
AtlanBusiness-user friendly, collaboration$$$$No
AlationEnterprise ML-powered cataloging$$$$$No
dbt Docsdbt-centric teams, lightweightFreeYes
AWS Glue CatalogAWS-native, auto-crawlingPer-useNo

What to Document Per Dataset

Metadata FieldExampleWho Maintains
Description”All active customer accounts with lifetime metrics”Data Steward
Ownersales_teamData Owner
SLARefreshed by 6 AM dailyData Engineer
PII flagYes (email, phone)Data Steward
Source systemSalesforce CRM → Fivetran → SnowflakeData Engineer
Quality score94% (3 of 50 checks failing)Automated
Downstream consumersRevenue dashboard, Churn model, Finance reportAuto-discovered
FreshnessLast updated 2 hours agoAutomated

Step 3: Implement Data Quality Rules

Quality Dimensions

DimensionDefinitionExample CheckAutomation
CompletenessNo critical nullsNOT NULL on required fieldsdbt test, Great Expectations
AccuracyValues match realityRevenue matches source systemCross-system reconciliation
ConsistencySame value everywhereCustomer name same in CRM + billingCross-database assertions
TimelinessData is fresh enoughDashboard updates within 1 hourFreshness monitoring
UniquenessNo duplicatesPrimary key uniquenessdbt_utils.unique_combination_of_columns
ValidityConforms to business rulesEmail matches regex, age > 0Schema tests + custom rules

Automated Quality Pipeline

-- Daily data quality checks (dbt tests pattern)

-- Test: No null customer IDs
SELECT COUNT(*) AS failures
FROM customers
WHERE customer_id IS NULL;

-- Test: Email format validation
SELECT COUNT(*) AS failures
FROM customers
WHERE email NOT LIKE '%_@_%.__%';

-- Test: Revenue must be positive
SELECT COUNT(*) AS failures
FROM orders
WHERE total_amount < 0;

-- Test: Referential integrity
SELECT COUNT(*) AS failures
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id
WHERE c.customer_id IS NULL;

-- Test: Data freshness (updated within 2 hours)
SELECT CASE
    WHEN MAX(updated_at) < NOW() - INTERVAL '2 hours'
    THEN 1 ELSE 0
END AS stale
FROM orders;

Quality Scoring Framework

# Calculate a data quality score per dataset
def calculate_quality_score(dataset_name: str, checks: list[dict]) -> float:
    """
    Score = (passed checks / total checks) × 100
    Weight critical checks more heavily.
    """
    total_weight = sum(check["weight"] for check in checks)
    passed_weight = sum(
        check["weight"] for check in checks if check["passed"]
    )
    score = (passed_weight / total_weight) * 100

    # Alert if below threshold
    if score < 90:
        alert_data_steward(dataset_name, score, checks)

    return round(score, 1)

# Example checks
checks = [
    {"name": "no_null_ids", "weight": 3, "passed": True},
    {"name": "valid_emails", "weight": 2, "passed": True},
    {"name": "fresh_data", "weight": 3, "passed": False},  # Stale!
    {"name": "no_duplicates", "weight": 2, "passed": True},
]

score = calculate_quality_score("customers", checks)
print(f"Quality score: {score}%")  # 70% — freshness check failed

When Quality Checks Should Run

Check TypeFrequencyExamples
Schema validationEvery pipeline runColumn types, nullable constraints
Row count anomalyEvery pipeline run±10% vs yesterday
FreshnessEvery 30 minutesLast update timestamp
Business rulesDailyRevenue > 0, valid geographies
Cross-system reconciliationWeeklyCRM total = DW total
Full data profilingMonthlyDistribution analysis, drift detection

Step 4: Track Data Lineage

Understanding where data comes from and where it goes is essential for trust and debugging.

Source Systems          Transformation          Consumption
┌──────────┐     ┌─────────────────┐     ┌──────────────┐
│ CRM      │────▶│  ETL Pipeline   │────▶│  Dashboard   │
│ (SFDC)   │     │  (Airflow/dbt)  │     │  (Power BI)  │
└──────────┘     └─────────────────┘     └──────────────┘
┌──────────┐            │                ┌──────────────┐
│ ERP      │────▶───────┤               │  ML Model    │
│ (D365)   │            │                │  (Forecast)  │
└──────────┘            ▼                └──────────────┘
┌──────────┐     ┌─────────────────┐     ┌──────────────┐
│ Website  │────▶│  Data Warehouse │────▶│  Ad-hoc SQL  │
│ (GA4)    │     │  (Snowflake)    │     │  (Analysts)  │
└──────────┘     └─────────────────┘     └──────────────┘

Why Lineage Matters

ScenarioWithout LineageWith Lineage
Dashboard shows wrong number”Where does this come from?” (hours of investigation)Click through lineage graph (5 minutes)
Source system schema changeAll downstream breaks unpredictablyImpact analysis before change
Data quality failure”What’s affected?” (nobody knows)Automated downstream alerting
Compliance audit (GDPR)“Where is PII stored?” (weeks of research)Search by classification tag

dbt Lineage

# dbt model with documentation and lineage
# models/marts/customers.sql
{{ config(
    materialized='table',
    description='Customer dimension with lifetime metrics',
    meta={
        'owner': 'sales_team',
        'sla': 'refreshed by 6 AM daily',
        'pii': true
    }
) }}

SELECT
    c.customer_id,
    c.full_name,
    c.email,
    c.created_at,
    COALESCE(SUM(o.total_amount), 0) AS lifetime_value,
    COUNT(o.order_id) AS total_orders,
    MAX(o.order_date) AS last_order_date
FROM {{ ref('stg_customers') }} c
LEFT JOIN {{ ref('stg_orders') }} o
    ON c.customer_id = o.customer_id
GROUP BY 1, 2, 3, 4

Step 5: Access Control and Classification

Data Classification Tiers

TierLabelExamplesAccessEncryption
Public🟢Marketing content, pricingAnyoneOptional
Internal🟡Revenue metrics, KPIsAll employeesIn transit
Confidential🟠Customer PII, contractsNeed-to-knowAt rest + transit
Restricted🔴SSN, payment data, healthRole-based + auditAt rest + transit + key management

Implementation

-- Row-level security in PostgreSQL
CREATE POLICY region_access ON customers
    FOR SELECT
    USING (region = current_setting('app.user_region'));

-- Column-level masking
CREATE VIEW customers_masked AS
SELECT
    customer_id,
    full_name,
    CASE
        WHEN current_user IN ('analyst', 'report_user')
        THEN '***@***.***'
        ELSE email
    END AS email,
    CASE
        WHEN current_user IN ('analyst', 'report_user')
        THEN 'XXX-XX-' || RIGHT(ssn, 4)
        ELSE ssn
    END AS ssn
FROM customers;

Governance Maturity Model

LevelDescriptionKey Indicators
1 - Ad HocNo formal governanceNo ownership, no catalog, tribal knowledge
2 - ReactiveGovernance exists but triggered by incidentsOwners assigned after a data incident
3 - ProactiveFormal program with automated controlsCatalog, quality checks, scheduled reviews
4 - ManagedMetrics-driven, continuously improvingQuality scores, SLA tracking, compliance dashboards
5 - OptimizedSelf-service with embedded governanceAutomated lineage, ML-powered anomaly detection

Governance Checklist

  • Data ownership assigned (one owner per domain), RACI documented
  • Data catalog deployed with searchable metadata (not just a spreadsheet)
  • Quality rules defined for every critical dataset (all 6 dimensions)
  • Automated quality checks running daily with quality scores
  • Data lineage documented (source → transform → consumption)
  • Classification tiers defined and enforced (Public → Restricted)
  • Row/column-level security implemented for PII/Confidential
  • Access request process with approval workflow
  • Quarterly data quality review meetings with owners
  • Incident response plan for data quality failures
  • Governance maturity self-assessed and improvement plan documented

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For data strategy consulting, visit garnetgrid.com. :::

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →