Data Governance Frameworks
Implement data governance that enables data usage while maintaining compliance. Covers data cataloging, data classification, access policies, data quality rules, lineage tracking, and the organizational structures that make governance effective.
Data governance answers three questions: What data do we have? Who is allowed to access it? Is it accurate? Without governance, organizations drown in data lakes full of undocumented, unclassified, ungoverned data that no one trusts and everyone is afraid to use.
Governance Pillars
1. Data Cataloging: What data exists and where
2. Data Classification: Sensitivity level of each dataset
3. Access Control: Who can access what, and how
4. Data Quality: Is the data accurate and complete?
5. Data Lineage: Where did this data come from, what transformed it?
6. Retention: How long do we keep data?
7. Privacy: PII handling, consent, right to deletion
Data Classification
classification_levels:
public:
description: "Data intended for public consumption"
examples: ["Marketing content", "Public APIs", "Blog posts"]
controls: "None required"
internal:
description: "Data for internal use, not sensitive"
examples: ["Internal wikis", "Non-sensitive metrics", "Team directories"]
controls: "Authentication required"
confidential:
description: "Business-sensitive data"
examples: ["Revenue data", "Customer lists", "Strategic plans"]
controls: "Role-based access, audit logging"
restricted:
description: "Highly sensitive, regulated data"
examples: ["PII", "PHI", "Payment card data", "SSN"]
controls: "Encryption, MFA, audit logging, DLP, retention limits"
auto_classification_rules:
- pattern: "SSN|social_security"
classification: restricted
- pattern: "email|phone|address"
classification: restricted
- pattern: "revenue|profit|margin"
classification: confidential
- pattern: "password|secret|token"
classification: restricted
Data Catalog
# Catalog entry for a dataset
catalog_entry = {
"name": "customer_360",
"schema": "analytics",
"description": "Unified customer profile combining CRM, product, and billing data",
"owner": "data-platform-team",
"steward": "jane.doe@company.com",
"classification": "restricted", # Contains PII
"columns": [
{"name": "customer_id", "type": "UUID", "classification": "internal", "pii": False},
{"name": "email", "type": "VARCHAR", "classification": "restricted", "pii": True},
{"name": "name", "type": "VARCHAR", "classification": "restricted", "pii": True},
{"name": "lifetime_value", "type": "DECIMAL", "classification": "confidential", "pii": False},
{"name": "churn_score", "type": "FLOAT", "classification": "confidential", "pii": False},
],
"lineage": {
"sources": ["crm.contacts", "billing.customers", "product.user_events"],
"transformations": ["dbt model: customer_360"],
"consumers": ["marketing-dashboard", "churn-prediction-model"]
},
"quality": {
"freshness": "Updated daily at 06:00 UTC",
"completeness": "email: 98%, name: 95%, ltv: 92%",
"uniqueness": "customer_id: 100% unique",
"tests": ["not_null: customer_id, email", "unique: customer_id"]
},
"retention": {
"policy": "7 years after last activity",
"pii_deletion": "90 days after deletion request"
}
}
Data Quality Rules
-- dbt tests for data quality
-- schema.yml
models:
- name: customer_360
columns:
- name: customer_id
tests:
- not_null
- unique
- name: email
tests:
- not_null
- accepted_values:
values: ['%@%.%'] # Valid email pattern
- name: lifetime_value
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 0
max_value: 10000000
- name: churn_score
tests:
- dbt_utils.accepted_range:
min_value: 0.0
max_value: 1.0
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| No data catalog | ”What data do we have?” → nobody knows | Automated catalog + manual enrichment |
| Classification = one-time project | New data unclassified | Auto-classification rules in pipeline |
| Governance without tooling | Manual, unsustainable | Data catalog tools (DataHub, OpenMetadata) |
| Governance blocks data access | Teams work around governance | Enable access with guardrails, not gates |
| No data quality monitoring | Decisions based on bad data | Automated quality checks in pipelines |
Data governance is the immune system for your data platform. Without it, data quality degrades, privacy violations accumulate, and trust erodes until no one believes the dashboards.