dbt Data Transformation: Best Practices & Pitfalls
Master dbt for analytics engineering. Covers project structure, model design, testing, incremental models, materializations, and common anti-patterns.
dbt (data build tool) has become the standard for SQL-based data transformation. It brings software engineering practices — version control, testing, documentation — to analytics SQL. But most dbt projects turn into unmaintainable messes within 6 months because teams skip the fundamentals: they put business logic in staging models, skip testing, create thousand-line SQL files with no intermediate steps, and treat dbt like a SQL scratchpad instead of a data engineering framework.
This guide covers the patterns that keep dbt projects clean at scale, the testing strategies that prevent bad data from reaching dashboards, and the performance optimizations that keep warehouse costs under control.
Project Structure
dbt_project/
├── models/
│ ├── staging/ # 1:1 with source tables, renaming + typing only
│ │ ├── shopify/
│ │ │ ├── stg_shopify__orders.sql
│ │ │ ├── stg_shopify__customers.sql
│ │ │ └── _shopify__sources.yml
│ │ └── stripe/
│ │ ├── stg_stripe__charges.sql
│ │ └── _stripe__sources.yml
│ ├── intermediate/ # Business logic, joins, aggregations
│ │ ├── int_order_items_enriched.sql
│ │ └── int_customer_lifetime.sql
│ ├── marts/ # Final business entities (what analysts query)
│ │ ├── core/
│ │ │ ├── fct_orders.sql
│ │ │ ├── dim_customers.sql
│ │ │ └── _core__models.yml
│ │ └── marketing/
│ │ ├── fct_campaign_performance.sql
│ │ └── _marketing__models.yml
│ └── metrics/ # Metric definitions (MetricFlow)
├── tests/ # Custom data tests (singular tests)
│ └── assert_order_total_matches_items.sql
├── macros/ # Reusable SQL functions
│ ├── generate_schema_name.sql
│ └── cents_to_dollars.sql
├── seeds/ # Static reference data (CSV)
│ └── country_codes.csv
├── snapshots/ # Track historical changes (SCD Type 2)
│ └── snap_customers.sql
└── dbt_project.yml
Naming Conventions
| Layer | Prefix | Example | Why |
|---|---|---|---|
| Staging | stg_[source]__[entity] | stg_shopify__orders | Source tracking, clear lineage |
| Intermediate | int_[entity]_[verb] | int_orders_enriched | Describes transformation intent |
| Fact (mart) | fct_[entity] | fct_orders | Denotes measurable business events |
| Dimension (mart) | dim_[entity] | dim_customers | Denotes descriptive attributes |
| Snapshot | snap_[entity] | snap_customers | SCD Type 2 historical tracking |
The Staging Layer
Every source table gets exactly one staging model. No business logic here — just renaming, casting, and cleaning raw data into a consistent format.
-- models/staging/shopify/stg_shopify__orders.sql
WITH source AS (
SELECT * FROM {{ source('shopify', 'orders') }}
),
renamed AS (
SELECT
id AS order_id,
customer_id,
CAST(created_at AS TIMESTAMP) AS ordered_at,
CAST(total_price AS DECIMAL(10,2)) AS order_total,
financial_status AS payment_status,
fulfillment_status,
LOWER(TRIM(email)) AS customer_email,
cancelled_at IS NOT NULL AS is_cancelled,
_etl_loaded_at AS _loaded_at -- metadata column for debugging
FROM source
WHERE id IS NOT NULL -- Filter truly invalid records at the source
)
SELECT * FROM renamed
# models/staging/shopify/_shopify__sources.yml
sources:
- name: shopify
schema: raw_shopify
description: "Raw Shopify data loaded by Fivetran"
tables:
- name: orders
loaded_at_field: _etl_loaded_at
freshness:
warn_after: { count: 12, period: hour }
error_after: { count: 24, period: hour }
columns:
- name: id
tests: [unique, not_null]
models:
- name: stg_shopify__orders
description: "Cleaned and renamed Shopify orders. One row per order."
columns:
- name: order_id
description: "Primary key from Shopify"
tests: [unique, not_null]
- name: order_total
tests: [not_null]
- name: customer_email
tests: [not_null]
Intermediate Models
Business logic lives here. Join staging models, apply business rules, compute derived fields. These are where most of the analytical value is created.
-- models/intermediate/int_order_items_enriched.sql
{{ config(materialized='view') }} -- Views for intermediate (save storage)
WITH order_items AS (
SELECT * FROM {{ ref('stg_shopify__order_items') }}
),
products AS (
SELECT * FROM {{ ref('stg_shopify__products') }}
),
enriched AS (
SELECT
oi.order_id,
oi.product_id,
p.product_name,
p.category,
oi.quantity,
oi.unit_price,
oi.quantity * oi.unit_price AS line_total,
CASE
WHEN p.category IN ('electronics', 'software') THEN 'digital'
WHEN p.category IN ('food', 'beverage') THEN 'consumable'
ELSE 'physical'
END AS product_type,
{{ cents_to_dollars('oi.unit_price') }} AS unit_price_dollars -- Use macros!
FROM order_items oi
LEFT JOIN products p ON oi.product_id = p.product_id
)
SELECT * FROM enriched
Mart Models
Analyst-facing tables. Wide, denormalized, well-documented. These are the only models that analysts and BI tools should query directly.
-- models/marts/core/fct_orders.sql
{{ config(
materialized='incremental',
unique_key='order_id',
on_schema_change='sync_all_columns'
) }}
WITH orders AS (
SELECT * FROM {{ ref('stg_shopify__orders') }}
{% if is_incremental() %}
WHERE ordered_at > (SELECT MAX(ordered_at) FROM {{ this }})
{% endif %}
),
items AS (
SELECT
order_id,
COUNT(*) AS item_count,
SUM(line_total) AS items_total,
COUNT(DISTINCT category) AS unique_categories
FROM {{ ref('int_order_items_enriched') }}
GROUP BY order_id
),
customer_orders AS (
SELECT
customer_email,
MIN(ordered_at) AS first_order_at,
COUNT(*) AS total_orders
FROM {{ ref('stg_shopify__orders') }}
GROUP BY customer_email
),
final AS (
SELECT
o.order_id,
o.customer_email,
o.ordered_at,
o.order_total,
o.payment_status,
o.is_cancelled,
i.item_count,
i.unique_categories,
DATE_TRUNC('month', o.ordered_at) AS order_month,
ROW_NUMBER() OVER (
PARTITION BY o.customer_email ORDER BY o.ordered_at
) AS customer_order_number,
CASE
WHEN co.total_orders = 1 THEN 'new'
WHEN ROW_NUMBER() OVER (PARTITION BY o.customer_email ORDER BY o.ordered_at) = 1 THEN 'new'
ELSE 'returning'
END AS customer_type
FROM orders o
LEFT JOIN items i ON o.order_id = i.order_id
LEFT JOIN customer_orders co ON o.customer_email = co.customer_email
)
SELECT * FROM final
Materialization Strategy
| Materialization | When to Use | Storage Cost | Query Speed | Rebuild Time |
|---|---|---|---|---|
| View | Staging + intermediate models (< 10M rows) | Zero | Slower (computes on query) | Instant |
| Table | Small-medium mart models (< 100M rows) | Medium | Fast (pre-computed) | Full rebuild |
| Incremental | Large mart models (> 100M rows, append-heavy) | Medium | Fast | Partial rebuild |
| Ephemeral | CTEs used by multiple models (code reuse) | Zero | N/A (compiled inline) | N/A |
Incremental Model Strategies
| Strategy | Use When | Snowflake | BigQuery | Redshift |
|---|---|---|---|---|
append | Events, logs (never updated) | ✅ | ✅ | ✅ |
merge | Records can be updated (orders) | ✅ | ✅ | ✅ |
delete+insert | Large batch updates | ✅ | ❌ | ✅ |
insert_overwrite | Partition-level replacement | ✅ | ✅ | ❌ |
Testing Strategy
Generic Tests (YAML)
# models/marts/core/_core__models.yml
models:
- name: fct_orders
description: "One row per order with enriched item metrics"
columns:
- name: order_id
tests: [unique, not_null]
- name: order_total
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 0
max_value: 100000
- name: item_count
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 1
- name: customer_type
tests:
- accepted_values:
values: ['new', 'returning']
Singular Tests (Custom SQL)
-- tests/assert_order_total_matches_items.sql
-- This test FAILS (returns rows) if order totals don't match line items
SELECT order_id, order_total, items_total
FROM {{ ref('fct_orders') }}
WHERE ABS(order_total - items_total) > 0.01
AND NOT is_cancelled
Source Freshness
# Run freshness checks in CI to catch stale data before it reaches dashboards
dbt source freshness --select source:shopify
Anti-Patterns
| Anti-Pattern | Problem | Fix |
|---|---|---|
| Business logic in staging | Staging should be pure rename/cast — logic here is invisible | Move all business logic to intermediate layer |
| Skipping intermediate layer | Mart models become 500-line monsters nobody can debug | Break complex transforms into composable intermediate steps |
| No tests on primary keys | Duplicate rows go undetected, dashboards show wrong numbers | Always test unique + not_null on every model’s primary key |
SELECT * in models | Schema changes in upstream sources silently break downstream | Explicitly list every column |
| Materializing everything as table | Slow full refreshes, wasted warehouse storage and compute | Use view for staging+intermediate, incremental for large marts |
| No source freshness checks | Stale data in dashboards without anyone knowing | Configure loaded_at_field and freshness thresholds |
| Hardcoded values | Magic numbers scattered across models | Use seeds for reference data, vars for configuration |
| No documentation | Nobody knows what columns mean 6 months later | Document every mart column in YAML |
| Circular references | dbt build fails, impossible to debug | Use DAG visualization (dbt docs generate && dbt docs serve) |
dbt in CI/CD
# .github/workflows/dbt-ci.yml
name: dbt CI
on:
pull_request:
paths: ['models/**', 'tests/**', 'macros/**']
jobs:
dbt-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install dbt
run: pip install dbt-snowflake==1.7.0
- name: dbt build (modified models only)
run: dbt build --select state:modified+ --defer --state ./prod-manifest/
env:
DBT_PROFILES_DIR: ./ci_profiles
- name: Source freshness
run: dbt source freshness
Checklist
- Three-layer architecture enforced (staging → intermediate → marts)
- Every source has a staging model with rename/cast only
- Naming conventions followed (
stg_,int_,fct_,dim_) - Primary key tests (
unique+not_null) on every model - Source freshness checks configured for all sources
- Incremental models for tables > 1M rows
- Documentation (description) for all mart-layer columns
- CI runs
dbt buildon every PR (modified models + downstream) - Macro library for repeated SQL patterns (no copy-paste SQL)
- DAG reviewed monthly for circular dependencies and fan-out
- Materialization strategy reviewed (views for small, incremental for large)
- Seeds used for static reference data (not hardcoded values)
:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For data engineering consulting, visit garnetgrid.com. :::