Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid
Data Warehouse Design: From Raw Events to Business Insights
Design a data warehouse that transforms raw event streams into analytics that drive business decisions. Covers dimensional modeling, the medallion architecture, slowly changing dimensions, ETL vs ELT, data quality frameworks, and the warehouse design that scales without becoming an unmaintainable mess.
A data warehouse is where raw data becomes knowledge. It takes the firehose of events from production systems — page views, purchases, API calls, user actions — and transforms them into the curated datasets that answer business questions: “Which marketing channel drives the most revenue?” “What is our customer retention rate by cohort?” “Are enterprise customers adopting the new feature?”
Most data warehouse projects fail not because of technology, but because of architecture: the warehouse becomes a copy of the production database with the same schema, the same query complexity, and none of the performance benefits.
Dimensional Modeling: Think in Facts and Dimensions
Concept
What It Is
Example
Fact table
Measurements, events (things that happened)
Orders, page views, transactions
Dimension table
Context about facts (who, what, where, when)
Users, products, dates, locations
Grain
What one row in the fact table represents
One order line item, one page view
Star Schema
-- Fact table: one row per order line itemCREATE TABLE fact_order_items ( order_item_id BIGINT PRIMARY KEY, order_id BIGINT NOT NULL, order_date_key INT REFERENCES dim_date(date_key), customer_key INT REFERENCES dim_customer(customer_key), product_key INT REFERENCES dim_product(product_key), channel_key INT REFERENCES dim_channel(channel_key), -- Measures (things you aggregate) quantity INT, unit_price DECIMAL(10,2), discount_amount DECIMAL(10,2), total_amount DECIMAL(10,2), cost_amount DECIMAL(10,2));-- Dimension table: product contextCREATE TABLE dim_product ( product_key INT PRIMARY KEY, -- Surrogate key product_id VARCHAR(50), -- Natural key (from source) product_name VARCHAR(255), category VARCHAR(100), subcategory VARCHAR(100), brand VARCHAR(100), is_active BOOLEAN, effective_date DATE, end_date DATE);-- Dimension table: date (pre-populated)CREATE TABLE dim_date ( date_key INT PRIMARY KEY, -- YYYYMMDD format full_date DATE, day_of_week VARCHAR(10), month_name VARCHAR(10), quarter INT, year INT, is_weekend BOOLEAN, is_holiday BOOLEAN, fiscal_quarter INT, fiscal_year INT);
Why Star Schema Works for Analytics
-- Simple, fast query: revenue by product category and quarterSELECT d.year, d.quarter, p.category, SUM(f.total_amount) AS revenue, COUNT(DISTINCT f.order_id) AS orders, SUM(f.total_amount) / COUNT(DISTINCT f.order_id) AS avg_order_valueFROM fact_order_items fJOIN dim_date d ON f.order_date_key = d.date_keyJOIN dim_product p ON f.product_key = p.product_keyWHERE d.year = 2024GROUP BY d.year, d.quarter, p.categoryORDER BY d.quarter, revenue DESC;-- The same query in a normalized production schema would require-- 5+ JOINs and be 10x slower on large datasets.
Medallion Architecture
┌─────────────────────────────────────────────────────┐│ MEDALLION ARCHITECTURE │├─────────────────────────────────────────────────────┤│ ││ 🥉 BRONZE (Raw) ││ Raw data as received from source systems ││ Append-only, no transformations, full history ││ "We can always go back to the original data" ││ ││ 🥈 SILVER (Cleaned) ││ Deduplicated, validated, standardized ││ Column names normalized, types cast, nulls handled ││ "Data is clean and trustworthy" ││ ││ 🥇 GOLD (Business-Ready) ││ Dimensional models, aggregated metrics, KPIs ││ Optimized for analyst consumption ││ "Analysts can query this directly" ││ │└─────────────────────────────────────────────────────┘
Layer
Schema
Audience
Retention
Bronze
Same as source (JSON, CSV, APIs)
Data engineers only
Years (cheap storage)
Silver
Normalized, validated tables
Data engineers + power analysts
1-3 years
Gold
Star schema, aggregated views
All analysts, dashboards, reports
Permanent
ETL vs ELT
Approach
Transform Where
Best For
ETL (Extract, Transform, Load)
Before loading into warehouse
Small datasets, complex transformations
ELT (Extract, Load, Transform)
Inside the warehouse
Large datasets, modern cloud warehouses
ETL (traditional): Source → Extract → Transform (external tool) → Load → WarehouseELT (modern): Source → Extract → Load (raw) → Transform (SQL in warehouse) → Gold tablesModern warehouses (BigQuery, Snowflake, Redshift) have massive compute.It's cheaper and simpler to load raw data and transform with SQLthan to maintain a separate transformation layer.
Slowly Changing Dimensions (SCD)
When dimension attributes change over time (customer changes their address, product changes category), you need a strategy for preserving history.
SCD Type
Strategy
Use When
Type 1
Overwrite old value
History does not matter
Type 2
Add new row with effective dates
Need historical accuracy
Type 3
Add column for previous value
Only need current and previous
-- SCD Type 2: Track customer address history-- Row 1: Active from 2023 to 2024-- Row 2: Active from 2024 to presentSELECT * FROM dim_customer WHERE customer_id = 'C001';-- customer_key | customer_id | name | city | effective_date | end_date | is_current-- 1001 | C001 | Alice | New York | 2023-01-15 | 2024-06-01 | false-- 1002 | C001 | Alice | Chicago | 2024-06-01 | 9999-12-31 | true-- Orders from 2023 JOIN to customer_key 1001 → New York-- Orders from 2024 JOIN to customer_key 1002 → Chicago-- Historical accuracy preserved!
Data Quality Framework
Quality Dimension
Check
Automation
Completeness
Are required fields populated?
NOT NULL checks, % null monitoring
Uniqueness
Are there duplicates?
Primary key validation, dedup queries
Timeliness
Is data arriving on schedule?
Pipeline lag monitoring, freshness alerts
Accuracy
Does the data match the source?
Reconciliation queries between source and warehouse
Consistency
Do related fields agree?
Cross-table validation (orders sum = revenue report)
-- Data quality check: daily reconciliationSELECT 'orders' AS table_name, source.count AS source_count, warehouse.count AS warehouse_count, source.count - warehouse.count AS difference, CASE WHEN ABS(source.count - warehouse.count) > 10 THEN 'FAIL' ELSE 'PASS' END AS statusFROM (SELECT COUNT(*) AS count FROM source_db.orders WHERE created_at::date = CURRENT_DATE - 1) source, (SELECT COUNT(*) AS count FROM bronze.orders WHERE created_at::date = CURRENT_DATE - 1) warehouse;
Implementation Checklist
Design dimensional models (star schema) for your top 3 business questions
Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.