ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

Data Warehouse Design: From Raw Events to Business Insights

Design a data warehouse that transforms raw event streams into analytics that drive business decisions. Covers dimensional modeling, the medallion architecture, slowly changing dimensions, ETL vs ELT, data quality frameworks, and the warehouse design that scales without becoming an unmaintainable mess.

A data warehouse is where raw data becomes knowledge. It takes the firehose of events from production systems — page views, purchases, API calls, user actions — and transforms them into the curated datasets that answer business questions: “Which marketing channel drives the most revenue?” “What is our customer retention rate by cohort?” “Are enterprise customers adopting the new feature?”

Most data warehouse projects fail not because of technology, but because of architecture: the warehouse becomes a copy of the production database with the same schema, the same query complexity, and none of the performance benefits.


Dimensional Modeling: Think in Facts and Dimensions

ConceptWhat It IsExample
Fact tableMeasurements, events (things that happened)Orders, page views, transactions
Dimension tableContext about facts (who, what, where, when)Users, products, dates, locations
GrainWhat one row in the fact table representsOne order line item, one page view

Star Schema

-- Fact table: one row per order line item
CREATE TABLE fact_order_items (
    order_item_id    BIGINT PRIMARY KEY,
    order_id         BIGINT NOT NULL,
    order_date_key   INT REFERENCES dim_date(date_key),
    customer_key     INT REFERENCES dim_customer(customer_key),
    product_key      INT REFERENCES dim_product(product_key),
    channel_key      INT REFERENCES dim_channel(channel_key),

    -- Measures (things you aggregate)
    quantity         INT,
    unit_price       DECIMAL(10,2),
    discount_amount  DECIMAL(10,2),
    total_amount     DECIMAL(10,2),
    cost_amount      DECIMAL(10,2)
);

-- Dimension table: product context
CREATE TABLE dim_product (
    product_key      INT PRIMARY KEY,        -- Surrogate key
    product_id       VARCHAR(50),            -- Natural key (from source)
    product_name     VARCHAR(255),
    category         VARCHAR(100),
    subcategory      VARCHAR(100),
    brand            VARCHAR(100),
    is_active        BOOLEAN,
    effective_date   DATE,
    end_date         DATE
);

-- Dimension table: date (pre-populated)
CREATE TABLE dim_date (
    date_key         INT PRIMARY KEY,        -- YYYYMMDD format
    full_date        DATE,
    day_of_week      VARCHAR(10),
    month_name       VARCHAR(10),
    quarter          INT,
    year             INT,
    is_weekend       BOOLEAN,
    is_holiday       BOOLEAN,
    fiscal_quarter   INT,
    fiscal_year      INT
);

Why Star Schema Works for Analytics

-- Simple, fast query: revenue by product category and quarter
SELECT
    d.year,
    d.quarter,
    p.category,
    SUM(f.total_amount) AS revenue,
    COUNT(DISTINCT f.order_id) AS orders,
    SUM(f.total_amount) / COUNT(DISTINCT f.order_id) AS avg_order_value
FROM fact_order_items f
JOIN dim_date d ON f.order_date_key = d.date_key
JOIN dim_product p ON f.product_key = p.product_key
WHERE d.year = 2024
GROUP BY d.year, d.quarter, p.category
ORDER BY d.quarter, revenue DESC;

-- The same query in a normalized production schema would require
-- 5+ JOINs and be 10x slower on large datasets.

Medallion Architecture

┌─────────────────────────────────────────────────────┐
│  MEDALLION ARCHITECTURE                              │
├─────────────────────────────────────────────────────┤
│                                                       │
│  🥉 BRONZE (Raw)                                      │
│  Raw data as received from source systems             │
│  Append-only, no transformations, full history        │
│  "We can always go back to the original data"         │
│                                                       │
│  🥈 SILVER (Cleaned)                                  │
│  Deduplicated, validated, standardized                │
│  Column names normalized, types cast, nulls handled   │
│  "Data is clean and trustworthy"                      │
│                                                       │
│  🥇 GOLD (Business-Ready)                             │
│  Dimensional models, aggregated metrics, KPIs         │
│  Optimized for analyst consumption                    │
│  "Analysts can query this directly"                   │
│                                                       │
└─────────────────────────────────────────────────────┘
LayerSchemaAudienceRetention
BronzeSame as source (JSON, CSV, APIs)Data engineers onlyYears (cheap storage)
SilverNormalized, validated tablesData engineers + power analysts1-3 years
GoldStar schema, aggregated viewsAll analysts, dashboards, reportsPermanent

ETL vs ELT

ApproachTransform WhereBest For
ETL (Extract, Transform, Load)Before loading into warehouseSmall datasets, complex transformations
ELT (Extract, Load, Transform)Inside the warehouseLarge datasets, modern cloud warehouses
ETL (traditional):
  Source → Extract → Transform (external tool) → Load → Warehouse

ELT (modern):
  Source → Extract → Load (raw) → Transform (SQL in warehouse) → Gold tables

Modern warehouses (BigQuery, Snowflake, Redshift) have massive compute.
It's cheaper and simpler to load raw data and transform with SQL
than to maintain a separate transformation layer.

Slowly Changing Dimensions (SCD)

When dimension attributes change over time (customer changes their address, product changes category), you need a strategy for preserving history.

SCD TypeStrategyUse When
Type 1Overwrite old valueHistory does not matter
Type 2Add new row with effective datesNeed historical accuracy
Type 3Add column for previous valueOnly need current and previous
-- SCD Type 2: Track customer address history
-- Row 1: Active from 2023 to 2024
-- Row 2: Active from 2024 to present

SELECT * FROM dim_customer WHERE customer_id = 'C001';

-- customer_key | customer_id | name  | city     | effective_date | end_date   | is_current
-- 1001         | C001        | Alice | New York | 2023-01-15     | 2024-06-01 | false
-- 1002         | C001        | Alice | Chicago  | 2024-06-01     | 9999-12-31 | true

-- Orders from 2023 JOIN to customer_key 1001 → New York
-- Orders from 2024 JOIN to customer_key 1002 → Chicago
-- Historical accuracy preserved!

Data Quality Framework

Quality DimensionCheckAutomation
CompletenessAre required fields populated?NOT NULL checks, % null monitoring
UniquenessAre there duplicates?Primary key validation, dedup queries
TimelinessIs data arriving on schedule?Pipeline lag monitoring, freshness alerts
AccuracyDoes the data match the source?Reconciliation queries between source and warehouse
ConsistencyDo related fields agree?Cross-table validation (orders sum = revenue report)
-- Data quality check: daily reconciliation
SELECT
    'orders' AS table_name,
    source.count AS source_count,
    warehouse.count AS warehouse_count,
    source.count - warehouse.count AS difference,
    CASE
        WHEN ABS(source.count - warehouse.count) > 10 THEN 'FAIL'
        ELSE 'PASS'
    END AS status
FROM
    (SELECT COUNT(*) AS count FROM source_db.orders
     WHERE created_at::date = CURRENT_DATE - 1) source,
    (SELECT COUNT(*) AS count FROM bronze.orders
     WHERE created_at::date = CURRENT_DATE - 1) warehouse;

Implementation Checklist

  • Design dimensional models (star schema) for your top 3 business questions
  • Implement medallion architecture: bronze (raw), silver (clean), gold (business-ready)
  • Use ELT: load raw data first, transform with SQL in the warehouse
  • Create a dim_date table covering at least 10 years with fiscal calendar
  • Implement SCD Type 2 for dimensions that change over time (customers, products)
  • Build data quality checks: completeness, uniqueness, timeliness, accuracy
  • Run daily reconciliation between source systems and warehouse
  • Alert on data freshness: if pipeline is > 2 hours late, notify data team
  • Document the grain of every fact table (what does one row represent?)
  • Create a data catalog: what tables exist, what they contain, who owns them
Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →