Data Warehouse Design: From Raw Events to Business Insights

A data warehouse is where raw data becomes knowledge. It takes the firehose of events from production systems — page views, purchases, API calls, user actions — and transforms them into the curated datasets that answer business questions: “Which marketing channel drives the most revenue?” “What is our customer retention rate by cohort?” “Are enterprise customers adopting the new feature?”

Most data warehouse projects fail not because of technology, but because of architecture: the warehouse becomes a copy of the production database with the same schema, the same query complexity, and none of the performance benefits.

Dimensional Modeling: Think in Facts and Dimensions

Concept	What It Is	Example
Fact table	Measurements, events (things that happened)	Orders, page views, transactions
Dimension table	Context about facts (who, what, where, when)	Users, products, dates, locations
Grain	What one row in the fact table represents	One order line item, one page view

Star Schema

-- Fact table: one row per order line item
CREATE TABLE fact_order_items (
    order_item_id    BIGINT PRIMARY KEY,
    order_id         BIGINT NOT NULL,
    order_date_key   INT REFERENCES dim_date(date_key),
    customer_key     INT REFERENCES dim_customer(customer_key),
    product_key      INT REFERENCES dim_product(product_key),
    channel_key      INT REFERENCES dim_channel(channel_key),

    -- Measures (things you aggregate)
    quantity         INT,
    unit_price       DECIMAL(10,2),
    discount_amount  DECIMAL(10,2),
    total_amount     DECIMAL(10,2),
    cost_amount      DECIMAL(10,2)
);

-- Dimension table: product context
CREATE TABLE dim_product (
    product_key      INT PRIMARY KEY,        -- Surrogate key
    product_id       VARCHAR(50),            -- Natural key (from source)
    product_name     VARCHAR(255),
    category         VARCHAR(100),
    subcategory      VARCHAR(100),
    brand            VARCHAR(100),
    is_active        BOOLEAN,
    effective_date   DATE,
    end_date         DATE
);

-- Dimension table: date (pre-populated)
CREATE TABLE dim_date (
    date_key         INT PRIMARY KEY,        -- YYYYMMDD format
    full_date        DATE,
    day_of_week      VARCHAR(10),
    month_name       VARCHAR(10),
    quarter          INT,
    year             INT,
    is_weekend       BOOLEAN,
    is_holiday       BOOLEAN,
    fiscal_quarter   INT,
    fiscal_year      INT
);

Why Star Schema Works for Analytics

-- Simple, fast query: revenue by product category and quarter
SELECT
    d.year,
    d.quarter,
    p.category,
    SUM(f.total_amount) AS revenue,
    COUNT(DISTINCT f.order_id) AS orders,
    SUM(f.total_amount) / COUNT(DISTINCT f.order_id) AS avg_order_value
FROM fact_order_items f
JOIN dim_date d ON f.order_date_key = d.date_key
JOIN dim_product p ON f.product_key = p.product_key
WHERE d.year = 2024
GROUP BY d.year, d.quarter, p.category
ORDER BY d.quarter, revenue DESC;

-- The same query in a normalized production schema would require
-- 5+ JOINs and be 10x slower on large datasets.

Medallion Architecture

┌─────────────────────────────────────────────────────┐
│  MEDALLION ARCHITECTURE                              │
├─────────────────────────────────────────────────────┤
│                                                       │
│  🥉 BRONZE (Raw)                                      │
│  Raw data as received from source systems             │
│  Append-only, no transformations, full history        │
│  "We can always go back to the original data"         │
│                                                       │
│  🥈 SILVER (Cleaned)                                  │
│  Deduplicated, validated, standardized                │
│  Column names normalized, types cast, nulls handled   │
│  "Data is clean and trustworthy"                      │
│                                                       │
│  🥇 GOLD (Business-Ready)                             │
│  Dimensional models, aggregated metrics, KPIs         │
│  Optimized for analyst consumption                    │
│  "Analysts can query this directly"                   │
│                                                       │
└─────────────────────────────────────────────────────┘

Layer	Schema	Audience	Retention
Bronze	Same as source (JSON, CSV, APIs)	Data engineers only	Years (cheap storage)
Silver	Normalized, validated tables	Data engineers + power analysts	1-3 years
Gold	Star schema, aggregated views	All analysts, dashboards, reports	Permanent

ETL vs ELT

Approach	Transform Where	Best For
ETL (Extract, Transform, Load)	Before loading into warehouse	Small datasets, complex transformations
ELT (Extract, Load, Transform)	Inside the warehouse	Large datasets, modern cloud warehouses

ETL (traditional):
  Source → Extract → Transform (external tool) → Load → Warehouse

ELT (modern):
  Source → Extract → Load (raw) → Transform (SQL in warehouse) → Gold tables

Modern warehouses (BigQuery, Snowflake, Redshift) have massive compute.
It's cheaper and simpler to load raw data and transform with SQL
than to maintain a separate transformation layer.

Slowly Changing Dimensions (SCD)

When dimension attributes change over time (customer changes their address, product changes category), you need a strategy for preserving history.

SCD Type	Strategy	Use When
Type 1	Overwrite old value	History does not matter
Type 2	Add new row with effective dates	Need historical accuracy
Type 3	Add column for previous value	Only need current and previous

-- SCD Type 2: Track customer address history
-- Row 1: Active from 2023 to 2024
-- Row 2: Active from 2024 to present

SELECT * FROM dim_customer WHERE customer_id = 'C001';

-- customer_key | customer_id | name  | city     | effective_date | end_date   | is_current
-- 1001         | C001        | Alice | New York | 2023-01-15     | 2024-06-01 | false
-- 1002         | C001        | Alice | Chicago  | 2024-06-01     | 9999-12-31 | true

-- Orders from 2023 JOIN to customer_key 1001 → New York
-- Orders from 2024 JOIN to customer_key 1002 → Chicago
-- Historical accuracy preserved!

Data Quality Framework

Quality Dimension	Check	Automation
Completeness	Are required fields populated?	NOT NULL checks, % null monitoring
Uniqueness	Are there duplicates?	Primary key validation, dedup queries
Timeliness	Is data arriving on schedule?	Pipeline lag monitoring, freshness alerts
Accuracy	Does the data match the source?	Reconciliation queries between source and warehouse
Consistency	Do related fields agree?	Cross-table validation (orders sum = revenue report)

-- Data quality check: daily reconciliation
SELECT
    'orders' AS table_name,
    source.count AS source_count,
    warehouse.count AS warehouse_count,
    source.count - warehouse.count AS difference,
    CASE
        WHEN ABS(source.count - warehouse.count) > 10 THEN 'FAIL'
        ELSE 'PASS'
    END AS status
FROM
    (SELECT COUNT(*) AS count FROM source_db.orders
     WHERE created_at::date = CURRENT_DATE - 1) source,
    (SELECT COUNT(*) AS count FROM bronze.orders
     WHERE created_at::date = CURRENT_DATE - 1) warehouse;

Dimensional Modeling: Think in Facts and Dimensions

Star Schema

Why Star Schema Works for Analytics

Medallion Architecture

ETL vs ELT

Slowly Changing Dimensions (SCD)

Data Quality Framework

Implementation Checklist

More in Data Science

A/B Testing at Scale

A/B Testing Statistical Framework

A/B Testing Infrastructure: Making Data-Driven Decisions Without Breaking Production