dbt Data Transformation: Best Practices & Pitfalls

dbt (data build tool) has become the standard for SQL-based data transformation. It brings software engineering practices — version control, testing, documentation — to analytics SQL. But most dbt projects turn into unmaintainable messes within 6 months because teams skip the fundamentals: they put business logic in staging models, skip testing, create thousand-line SQL files with no intermediate steps, and treat dbt like a SQL scratchpad instead of a data engineering framework.

This guide covers the patterns that keep dbt projects clean at scale, the testing strategies that prevent bad data from reaching dashboards, and the performance optimizations that keep warehouse costs under control.

Project Structure

dbt_project/
├── models/
│   ├── staging/             # 1:1 with source tables, renaming + typing only
│   │   ├── shopify/
│   │   │   ├── stg_shopify__orders.sql
│   │   │   ├── stg_shopify__customers.sql
│   │   │   └── _shopify__sources.yml
│   │   └── stripe/
│   │       ├── stg_stripe__charges.sql
│   │       └── _stripe__sources.yml
│   ├── intermediate/        # Business logic, joins, aggregations
│   │   ├── int_order_items_enriched.sql
│   │   └── int_customer_lifetime.sql
│   ├── marts/               # Final business entities (what analysts query)
│   │   ├── core/
│   │   │   ├── fct_orders.sql
│   │   │   ├── dim_customers.sql
│   │   │   └── _core__models.yml
│   │   └── marketing/
│   │       ├── fct_campaign_performance.sql
│   │       └── _marketing__models.yml
│   └── metrics/             # Metric definitions (MetricFlow)
├── tests/                   # Custom data tests (singular tests)
│   └── assert_order_total_matches_items.sql
├── macros/                  # Reusable SQL functions
│   ├── generate_schema_name.sql
│   └── cents_to_dollars.sql
├── seeds/                   # Static reference data (CSV)
│   └── country_codes.csv
├── snapshots/               # Track historical changes (SCD Type 2)
│   └── snap_customers.sql
└── dbt_project.yml

Naming Conventions

Layer	Prefix	Example	Why
Staging	`stg_[source]__[entity]`	`stg_shopify__orders`	Source tracking, clear lineage
Intermediate	`int_[entity]_[verb]`	`int_orders_enriched`	Describes transformation intent
Fact (mart)	`fct_[entity]`	`fct_orders`	Denotes measurable business events
Dimension (mart)	`dim_[entity]`	`dim_customers`	Denotes descriptive attributes
Snapshot	`snap_[entity]`	`snap_customers`	SCD Type 2 historical tracking

The Staging Layer

Every source table gets exactly one staging model. No business logic here — just renaming, casting, and cleaning raw data into a consistent format.

-- models/staging/shopify/stg_shopify__orders.sql
WITH source AS (
    SELECT * FROM {{ source('shopify', 'orders') }}
),

renamed AS (
    SELECT
        id AS order_id,
        customer_id,
        CAST(created_at AS TIMESTAMP) AS ordered_at,
        CAST(total_price AS DECIMAL(10,2)) AS order_total,
        financial_status AS payment_status,
        fulfillment_status,
        LOWER(TRIM(email)) AS customer_email,
        cancelled_at IS NOT NULL AS is_cancelled,
        _etl_loaded_at AS _loaded_at  -- metadata column for debugging
    FROM source
    WHERE id IS NOT NULL  -- Filter truly invalid records at the source
)

SELECT * FROM renamed

# models/staging/shopify/_shopify__sources.yml
sources:
  - name: shopify
    schema: raw_shopify
    description: "Raw Shopify data loaded by Fivetran"
    tables:
      - name: orders
        loaded_at_field: _etl_loaded_at
        freshness:
          warn_after: { count: 12, period: hour }
          error_after: { count: 24, period: hour }
        columns:
          - name: id
            tests: [unique, not_null]

models:
  - name: stg_shopify__orders
    description: "Cleaned and renamed Shopify orders. One row per order."
    columns:
      - name: order_id
        description: "Primary key from Shopify"
        tests: [unique, not_null]
      - name: order_total
        tests: [not_null]
      - name: customer_email
        tests: [not_null]

Intermediate Models

Business logic lives here. Join staging models, apply business rules, compute derived fields. These are where most of the analytical value is created.

-- models/intermediate/int_order_items_enriched.sql
{{ config(materialized='view') }}  -- Views for intermediate (save storage)

WITH order_items AS (
    SELECT * FROM {{ ref('stg_shopify__order_items') }}
),

products AS (
    SELECT * FROM {{ ref('stg_shopify__products') }}
),

enriched AS (
    SELECT
        oi.order_id,
        oi.product_id,
        p.product_name,
        p.category,
        oi.quantity,
        oi.unit_price,
        oi.quantity * oi.unit_price AS line_total,
        CASE
            WHEN p.category IN ('electronics', 'software') THEN 'digital'
            WHEN p.category IN ('food', 'beverage') THEN 'consumable'
            ELSE 'physical'
        END AS product_type,
        {{ cents_to_dollars('oi.unit_price') }} AS unit_price_dollars  -- Use macros!
    FROM order_items oi
    LEFT JOIN products p ON oi.product_id = p.product_id
)

SELECT * FROM enriched

Mart Models

Analyst-facing tables. Wide, denormalized, well-documented. These are the only models that analysts and BI tools should query directly.

-- models/marts/core/fct_orders.sql
{{ config(
    materialized='incremental',
    unique_key='order_id',
    on_schema_change='sync_all_columns'
) }}

WITH orders AS (
    SELECT * FROM {{ ref('stg_shopify__orders') }}
    {% if is_incremental() %}
    WHERE ordered_at > (SELECT MAX(ordered_at) FROM {{ this }})
    {% endif %}
),

items AS (
    SELECT
        order_id,
        COUNT(*) AS item_count,
        SUM(line_total) AS items_total,
        COUNT(DISTINCT category) AS unique_categories
    FROM {{ ref('int_order_items_enriched') }}
    GROUP BY order_id
),

customer_orders AS (
    SELECT
        customer_email,
        MIN(ordered_at) AS first_order_at,
        COUNT(*) AS total_orders
    FROM {{ ref('stg_shopify__orders') }}
    GROUP BY customer_email
),

final AS (
    SELECT
        o.order_id,
        o.customer_email,
        o.ordered_at,
        o.order_total,
        o.payment_status,
        o.is_cancelled,
        i.item_count,
        i.unique_categories,
        DATE_TRUNC('month', o.ordered_at) AS order_month,
        ROW_NUMBER() OVER (
            PARTITION BY o.customer_email ORDER BY o.ordered_at
        ) AS customer_order_number,
        CASE
            WHEN co.total_orders = 1 THEN 'new'
            WHEN ROW_NUMBER() OVER (PARTITION BY o.customer_email ORDER BY o.ordered_at) = 1 THEN 'new'
            ELSE 'returning'
        END AS customer_type
    FROM orders o
    LEFT JOIN items i ON o.order_id = i.order_id
    LEFT JOIN customer_orders co ON o.customer_email = co.customer_email
)

SELECT * FROM final

Materialization Strategy

Materialization	When to Use	Storage Cost	Query Speed	Rebuild Time
View	Staging + intermediate models (< 10M rows)	Zero	Slower (computes on query)	Instant
Table	Small-medium mart models (< 100M rows)	Medium	Fast (pre-computed)	Full rebuild
Incremental	Large mart models (> 100M rows, append-heavy)	Medium	Fast	Partial rebuild
Ephemeral	CTEs used by multiple models (code reuse)	Zero	N/A (compiled inline)	N/A

Incremental Model Strategies

Strategy	Use When	Snowflake	BigQuery	Redshift
`append`	Events, logs (never updated)	✅	✅	✅
`merge`	Records can be updated (orders)	✅	✅	✅
`delete+insert`	Large batch updates	✅	❌	✅
`insert_overwrite`	Partition-level replacement	✅	✅	❌

Testing Strategy

Generic Tests (YAML)

# models/marts/core/_core__models.yml
models:
  - name: fct_orders
    description: "One row per order with enriched item metrics"
    columns:
      - name: order_id
        tests: [unique, not_null]
      - name: order_total
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0
              max_value: 100000
      - name: item_count
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 1
      - name: customer_type
        tests:
          - accepted_values:
              values: ['new', 'returning']

Singular Tests (Custom SQL)

-- tests/assert_order_total_matches_items.sql
-- This test FAILS (returns rows) if order totals don't match line items
SELECT order_id, order_total, items_total
FROM {{ ref('fct_orders') }}
WHERE ABS(order_total - items_total) > 0.01
  AND NOT is_cancelled

Source Freshness

# Run freshness checks in CI to catch stale data before it reaches dashboards
dbt source freshness --select source:shopify

Anti-Patterns

Anti-Pattern	Problem	Fix
Business logic in staging	Staging should be pure rename/cast — logic here is invisible	Move all business logic to intermediate layer
Skipping intermediate layer	Mart models become 500-line monsters nobody can debug	Break complex transforms into composable intermediate steps
No tests on primary keys	Duplicate rows go undetected, dashboards show wrong numbers	Always test `unique` + `not_null` on every model’s primary key
`SELECT *` in models	Schema changes in upstream sources silently break downstream	Explicitly list every column
Materializing everything as table	Slow full refreshes, wasted warehouse storage and compute	Use `view` for staging+intermediate, `incremental` for large marts
No source freshness checks	Stale data in dashboards without anyone knowing	Configure `loaded_at_field` and freshness thresholds
Hardcoded values	Magic numbers scattered across models	Use `seeds` for reference data, `vars` for configuration
No documentation	Nobody knows what columns mean 6 months later	Document every mart column in YAML
Circular references	dbt build fails, impossible to debug	Use DAG visualization (`dbt docs generate && dbt docs serve`)

dbt in CI/CD

# .github/workflows/dbt-ci.yml
name: dbt CI
on:
  pull_request:
    paths: ['models/**', 'tests/**', 'macros/**']

jobs:
  dbt-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Install dbt
        run: pip install dbt-snowflake==1.7.0

      - name: dbt build (modified models only)
        run: dbt build --select state:modified+ --defer --state ./prod-manifest/
        env:
          DBT_PROFILES_DIR: ./ci_profiles

      - name: Source freshness
        run: dbt source freshness

Checklist

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For data engineering consulting, visit garnetgrid.com. :::