Data Lineage & Observability

Data lineage answers the question every data team dreads: “This dashboard number looks wrong — where did this data come from?” Without lineage, debugging data issues requires manually tracing SQL queries, checking pipeline logs, and calling five different teams. With lineage, you can click on a metric and see every table, transformation, and source that contributed to it.

Lineage Types

Type	What It Tracks	Use Case
Table-level	Table A → Table B	”What tables feed this dashboard?”
Column-level	Column A.x → Column B.y	”Where does the revenue number come from?”
Job-level	Pipeline/DAG dependencies	”What breaks if this job fails?”
Row-level	Specific record provenance	”Which source records produced this output?”

Architecture

Data Sources          Transformation         Consumption
┌──────────┐         ┌──────────────┐       ┌──────────┐
│ Postgres  ├────┐    │ dbt          ├──┐    │Dashboard │
│ MySQL     ├────┼───▶│ Spark        │  ├───▶│ML Model  │
│ APIs      ├────┘    │ Airflow      │  │    │Reports   │
└──────────┘         └──────┬───────┘  │    └──────────┘
                            │          │
                    ┌───────▼──────────▼────┐
                    │   Lineage Metadata     │
                    │   (OpenLineage/        │
                    │    DataHub/Atlan)       │
                    └────────────────────────┘

OpenLineage Integration

from openlineage.client import OpenLineageClient
from openlineage.client.run import RunEvent, RunState, Run, Job
from openlineage.client.facet import SqlJobFacet

client = OpenLineageClient(url="http://lineage-api:5000")

# Emit lineage events from your pipeline
def emit_lineage(job_name, inputs, outputs, sql_query=None):
    run = Run(runId=str(uuid4()))
    job = Job(namespace="commerce-pipeline", name=job_name)
    
    event = RunEvent(
        eventType=RunState.COMPLETE,
        run=run,
        job=job,
        inputs=[
            Dataset(namespace="postgres", name=table)
            for table in inputs
        ],
        outputs=[
            Dataset(namespace="warehouse", name=table)
            for table in outputs
        ],
    )
    
    if sql_query:
        event.job.facets["sql"] = SqlJobFacet(query=sql_query)
    
    client.emit(event)

Impact Analysis

def impact_analysis(changed_table, lineage_graph):
    """Find all downstream tables/dashboards affected by a change."""
    
    affected = set()
    queue = [changed_table]
    
    while queue:
        current = queue.pop(0)
        downstream = lineage_graph.get_downstream(current)
        
        for item in downstream:
            if item not in affected:
                affected.add(item)
                queue.append(item)
    
    return {
        "changed": changed_table,
        "affected_tables": [a for a in affected if a.type == "table"],
        "affected_dashboards": [a for a in affected if a.type == "dashboard"],
        "affected_ml_models": [a for a in affected if a.type == "model"],
        "total_affected": len(affected),
    }

Data Catalog Integration

Catalog	Lineage Support	Best For
DataHub	OpenLineage, SQL parsing	Open source, enterprise-ready
Atlan	Automated lineage discovery	Managed, collaboration-focused
OpenMetadata	OpenLineage, dbt	Open source, metadata-first
Unity Catalog	Databricks ecosystem	Databricks users
AWS Glue Catalog	AWS native services	AWS-centric data stacks

Anti-Patterns

Anti-Pattern	Problem	Fix
Manual lineage documentation	Stale immediately, nobody maintains it	Automated lineage extraction
Table-level only	”Revenue comes from orders” isn’t enough	Column-level lineage for debugging
No impact analysis	Schema changes break unknown downstream	Query lineage before schema changes
Lineage as nice-to-have	Only used during incidents	Integrate into daily workflows (PRs, deployments)

Checklist

Lineage tool selected (DataHub, Atlan, OpenMetadata)
OpenLineage integration for pipelines (Airflow, Spark, dbt)
Column-level lineage available for critical datasets
Impact analysis before schema changes
Data catalog: all datasets discoverable with lineage
Root cause analysis workflow documented
Lineage freshness: metadata updated on every pipeline run

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For data lineage consulting, visit garnetgrid.com. :::

Lineage Types

Architecture

OpenLineage Integration

Impact Analysis

Data Catalog Integration

Anti-Patterns

Checklist

More in Data Engineering

CDC Pipeline Architecture

Change Data Capture (CDC) Patterns

Batch Processing at Scale