Data Governance & Data Catalog

Data governance without tooling is policy that nobody follows. Data governance without policy is tooling that nobody trusts. You need both: clear policies for how data should be classified, accessed, and used, combined with automated tooling that enforces those policies at scale.

Data Classification

Level	Data Types	Access	Examples
Public	Marketing content, product info	Anyone	Blog posts, pricing page
Internal	Business metrics, internal docs	All employees	Revenue dashboards, wiki
Confidential	Customer data, financial data	Need-to-know	Customer PII, contracts
Restricted	Cryptographic keys, credentials	Named individuals	API keys, passwords

Data Catalog

A data catalog is the “Google for your data.” It answers: What data do we have? Where is it? What does it mean? Who owns it? Who can access it?

Tool	Type	Best For
DataHub (LinkedIn)	Open source	Engineering-driven organizations
OpenMetadata	Open source	Modern data stack (dbt, Airflow)
Amundsen (Lyft)	Open source	Discovery-focused, Python ecosystem
Atlan	Commercial	Enterprise governance + discovery
Collibra	Commercial	Large enterprise, regulatory compliance
dbt docs	Built-in	Already using dbt (lightweight catalog)

Metadata Management

# Table-level metadata
table: customers
schema: analytics
description: "All registered customer accounts. One row per customer."
owner: customer-data-team
pii: true
classification: confidential
refresh: daily (6:00 AM UTC)

columns:
  - name: customer_id
    type: integer
    description: "Primary identifier for customer"
    pii: false
    
  - name: email
    type: string
    description: "Customer email address"
    pii: true
    masking: "hash in non-production environments"
    
  - name: full_name
    type: string
    description: "Customer legal name"
    pii: true
    masking: "redact in analytics views"
    
  - name: segment
    type: string
    description: "Customer segment: 'enterprise', 'mid-market', 'smb'"
    pii: false
    allowed_values: ["enterprise", "mid-market", "smb"]

Data Stewardship Model

Role	Responsibility	Example
Data Owner	Accountable for data quality and access	VP of Sales owns CRM data
Data Steward	Day-to-day governance, quality rules	Data analyst maintains quality rules
Data Engineer	Pipeline reliability, schema management	Builds and monitors pipelines
Data Consumer	Uses data responsibly, reports issues	Business analyst building reports
Privacy Officer	Compliance, retention policies	Reviews PII handling, GDPR/CCPA

Anti-Patterns

Anti-Pattern	Problem	Fix
No data catalog	”Where is the customer churn data?” → 3-day search	Catalog all data assets, searchable
No data owners	Nobody responsible, quality degrades	Named owner for every data domain
Governance = blocking	Data requests take weeks to approve	Self-service with guardrails, not gates
Classification in name only	Data labeled but no enforcement	Automated access controls based on classification
PII everywhere	Compliance risk, breach impact unlimited	PII detection, masking, access logging

Checklist

Data classification policy defined (4 levels minimum)
Data catalog deployed and populated
Every table/dataset has a documented owner
Column-level metadata: descriptions, PII flags, masking rules
Access controls enforced based on classification
Data quality rules defined and automated
PII detection and masking in non-production
Compliance: retention policies, right-to-delete processes
Data stewardship roles assigned per domain

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For data governance consulting, visit garnetgrid.com. :::

Data Classification

Data Catalog

Metadata Management

Data Stewardship Model

Anti-Patterns

Checklist

More in Data Engineering

CDC Pipeline Architecture

Change Data Capture (CDC) Patterns

Batch Processing at Scale