Verified by Garnet Grid

dbt Data Transformation: Best Practices & Pitfalls

Master dbt for analytics engineering. Covers project structure, model design, testing, incremental models, materializations, and common anti-patterns.

dbt (data build tool) has become the standard for SQL-based data transformation. It brings software engineering practices — version control, testing, documentation — to analytics SQL. But most dbt projects turn into unmaintainable messes within 6 months because teams skip the fundamentals: they put business logic in staging models, skip testing, create thousand-line SQL files with no intermediate steps, and treat dbt like a SQL scratchpad instead of a data engineering framework.

This guide covers the patterns that keep dbt projects clean at scale, the testing strategies that prevent bad data from reaching dashboards, and the performance optimizations that keep warehouse costs under control.


Project Structure

dbt_project/
├── models/
│   ├── staging/             # 1:1 with source tables, renaming + typing only
│   │   ├── shopify/
│   │   │   ├── stg_shopify__orders.sql
│   │   │   ├── stg_shopify__customers.sql
│   │   │   └── _shopify__sources.yml
│   │   └── stripe/
│   │       ├── stg_stripe__charges.sql
│   │       └── _stripe__sources.yml
│   ├── intermediate/        # Business logic, joins, aggregations
│   │   ├── int_order_items_enriched.sql
│   │   └── int_customer_lifetime.sql
│   ├── marts/               # Final business entities (what analysts query)
│   │   ├── core/
│   │   │   ├── fct_orders.sql
│   │   │   ├── dim_customers.sql
│   │   │   └── _core__models.yml
│   │   └── marketing/
│   │       ├── fct_campaign_performance.sql
│   │       └── _marketing__models.yml
│   └── metrics/             # Metric definitions (MetricFlow)
├── tests/                   # Custom data tests (singular tests)
│   └── assert_order_total_matches_items.sql
├── macros/                  # Reusable SQL functions
│   ├── generate_schema_name.sql
│   └── cents_to_dollars.sql
├── seeds/                   # Static reference data (CSV)
│   └── country_codes.csv
├── snapshots/               # Track historical changes (SCD Type 2)
│   └── snap_customers.sql
└── dbt_project.yml

Naming Conventions

LayerPrefixExampleWhy
Stagingstg_[source]__[entity]stg_shopify__ordersSource tracking, clear lineage
Intermediateint_[entity]_[verb]int_orders_enrichedDescribes transformation intent
Fact (mart)fct_[entity]fct_ordersDenotes measurable business events
Dimension (mart)dim_[entity]dim_customersDenotes descriptive attributes
Snapshotsnap_[entity]snap_customersSCD Type 2 historical tracking

The Staging Layer

Every source table gets exactly one staging model. No business logic here — just renaming, casting, and cleaning raw data into a consistent format.

-- models/staging/shopify/stg_shopify__orders.sql
WITH source AS (
    SELECT * FROM {{ source('shopify', 'orders') }}
),

renamed AS (
    SELECT
        id AS order_id,
        customer_id,
        CAST(created_at AS TIMESTAMP) AS ordered_at,
        CAST(total_price AS DECIMAL(10,2)) AS order_total,
        financial_status AS payment_status,
        fulfillment_status,
        LOWER(TRIM(email)) AS customer_email,
        cancelled_at IS NOT NULL AS is_cancelled,
        _etl_loaded_at AS _loaded_at  -- metadata column for debugging
    FROM source
    WHERE id IS NOT NULL  -- Filter truly invalid records at the source
)

SELECT * FROM renamed
# models/staging/shopify/_shopify__sources.yml
sources:
  - name: shopify
    schema: raw_shopify
    description: "Raw Shopify data loaded by Fivetran"
    tables:
      - name: orders
        loaded_at_field: _etl_loaded_at
        freshness:
          warn_after: { count: 12, period: hour }
          error_after: { count: 24, period: hour }
        columns:
          - name: id
            tests: [unique, not_null]

models:
  - name: stg_shopify__orders
    description: "Cleaned and renamed Shopify orders. One row per order."
    columns:
      - name: order_id
        description: "Primary key from Shopify"
        tests: [unique, not_null]
      - name: order_total
        tests: [not_null]
      - name: customer_email
        tests: [not_null]

Intermediate Models

Business logic lives here. Join staging models, apply business rules, compute derived fields. These are where most of the analytical value is created.

-- models/intermediate/int_order_items_enriched.sql
{{ config(materialized='view') }}  -- Views for intermediate (save storage)

WITH order_items AS (
    SELECT * FROM {{ ref('stg_shopify__order_items') }}
),

products AS (
    SELECT * FROM {{ ref('stg_shopify__products') }}
),

enriched AS (
    SELECT
        oi.order_id,
        oi.product_id,
        p.product_name,
        p.category,
        oi.quantity,
        oi.unit_price,
        oi.quantity * oi.unit_price AS line_total,
        CASE
            WHEN p.category IN ('electronics', 'software') THEN 'digital'
            WHEN p.category IN ('food', 'beverage') THEN 'consumable'
            ELSE 'physical'
        END AS product_type,
        {{ cents_to_dollars('oi.unit_price') }} AS unit_price_dollars  -- Use macros!
    FROM order_items oi
    LEFT JOIN products p ON oi.product_id = p.product_id
)

SELECT * FROM enriched

Mart Models

Analyst-facing tables. Wide, denormalized, well-documented. These are the only models that analysts and BI tools should query directly.

-- models/marts/core/fct_orders.sql
{{ config(
    materialized='incremental',
    unique_key='order_id',
    on_schema_change='sync_all_columns'
) }}

WITH orders AS (
    SELECT * FROM {{ ref('stg_shopify__orders') }}
    {% if is_incremental() %}
    WHERE ordered_at > (SELECT MAX(ordered_at) FROM {{ this }})
    {% endif %}
),

items AS (
    SELECT
        order_id,
        COUNT(*) AS item_count,
        SUM(line_total) AS items_total,
        COUNT(DISTINCT category) AS unique_categories
    FROM {{ ref('int_order_items_enriched') }}
    GROUP BY order_id
),

customer_orders AS (
    SELECT
        customer_email,
        MIN(ordered_at) AS first_order_at,
        COUNT(*) AS total_orders
    FROM {{ ref('stg_shopify__orders') }}
    GROUP BY customer_email
),

final AS (
    SELECT
        o.order_id,
        o.customer_email,
        o.ordered_at,
        o.order_total,
        o.payment_status,
        o.is_cancelled,
        i.item_count,
        i.unique_categories,
        DATE_TRUNC('month', o.ordered_at) AS order_month,
        ROW_NUMBER() OVER (
            PARTITION BY o.customer_email ORDER BY o.ordered_at
        ) AS customer_order_number,
        CASE
            WHEN co.total_orders = 1 THEN 'new'
            WHEN ROW_NUMBER() OVER (PARTITION BY o.customer_email ORDER BY o.ordered_at) = 1 THEN 'new'
            ELSE 'returning'
        END AS customer_type
    FROM orders o
    LEFT JOIN items i ON o.order_id = i.order_id
    LEFT JOIN customer_orders co ON o.customer_email = co.customer_email
)

SELECT * FROM final

Materialization Strategy

MaterializationWhen to UseStorage CostQuery SpeedRebuild Time
ViewStaging + intermediate models (< 10M rows)ZeroSlower (computes on query)Instant
TableSmall-medium mart models (< 100M rows)MediumFast (pre-computed)Full rebuild
IncrementalLarge mart models (> 100M rows, append-heavy)MediumFastPartial rebuild
EphemeralCTEs used by multiple models (code reuse)ZeroN/A (compiled inline)N/A

Incremental Model Strategies

StrategyUse WhenSnowflakeBigQueryRedshift
appendEvents, logs (never updated)
mergeRecords can be updated (orders)
delete+insertLarge batch updates
insert_overwritePartition-level replacement

Testing Strategy

Generic Tests (YAML)

# models/marts/core/_core__models.yml
models:
  - name: fct_orders
    description: "One row per order with enriched item metrics"
    columns:
      - name: order_id
        tests: [unique, not_null]
      - name: order_total
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0
              max_value: 100000
      - name: item_count
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 1
      - name: customer_type
        tests:
          - accepted_values:
              values: ['new', 'returning']

Singular Tests (Custom SQL)

-- tests/assert_order_total_matches_items.sql
-- This test FAILS (returns rows) if order totals don't match line items
SELECT order_id, order_total, items_total
FROM {{ ref('fct_orders') }}
WHERE ABS(order_total - items_total) > 0.01
  AND NOT is_cancelled

Source Freshness

# Run freshness checks in CI to catch stale data before it reaches dashboards
dbt source freshness --select source:shopify

Anti-Patterns

Anti-PatternProblemFix
Business logic in stagingStaging should be pure rename/cast — logic here is invisibleMove all business logic to intermediate layer
Skipping intermediate layerMart models become 500-line monsters nobody can debugBreak complex transforms into composable intermediate steps
No tests on primary keysDuplicate rows go undetected, dashboards show wrong numbersAlways test unique + not_null on every model’s primary key
SELECT * in modelsSchema changes in upstream sources silently break downstreamExplicitly list every column
Materializing everything as tableSlow full refreshes, wasted warehouse storage and computeUse view for staging+intermediate, incremental for large marts
No source freshness checksStale data in dashboards without anyone knowingConfigure loaded_at_field and freshness thresholds
Hardcoded valuesMagic numbers scattered across modelsUse seeds for reference data, vars for configuration
No documentationNobody knows what columns mean 6 months laterDocument every mart column in YAML
Circular referencesdbt build fails, impossible to debugUse DAG visualization (dbt docs generate && dbt docs serve)

dbt in CI/CD

# .github/workflows/dbt-ci.yml
name: dbt CI
on:
  pull_request:
    paths: ['models/**', 'tests/**', 'macros/**']

jobs:
  dbt-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Install dbt
        run: pip install dbt-snowflake==1.7.0

      - name: dbt build (modified models only)
        run: dbt build --select state:modified+ --defer --state ./prod-manifest/
        env:
          DBT_PROFILES_DIR: ./ci_profiles

      - name: Source freshness
        run: dbt source freshness

Checklist

  • Three-layer architecture enforced (staging → intermediate → marts)
  • Every source has a staging model with rename/cast only
  • Naming conventions followed (stg_, int_, fct_, dim_)
  • Primary key tests (unique + not_null) on every model
  • Source freshness checks configured for all sources
  • Incremental models for tables > 1M rows
  • Documentation (description) for all mart-layer columns
  • CI runs dbt build on every PR (modified models + downstream)
  • Macro library for repeated SQL patterns (no copy-paste SQL)
  • DAG reviewed monthly for circular dependencies and fan-out
  • Materialization strategy reviewed (views for small, incremental for large)
  • Seeds used for static reference data (not hardcoded values)

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For data engineering consulting, visit garnetgrid.com. :::

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →