ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

Data Lakehouse Architecture

Combine data lake flexibility with data warehouse performance. Covers lakehouse design principles, Delta Lake, Apache Iceberg, table formats, schema evolution, time travel, and the patterns that eliminate the data lake vs. warehouse tradeoff.

Data lakes store everything cheaply but are slow and unreliable for analytics. Data warehouses are fast and reliable but expensive and rigid. The lakehouse combines both: open file formats on cheap object storage with ACID transactions, schema enforcement, and warehouse-level query performance.


Architecture Comparison

Data Warehouse:
  Storage: Proprietary (Snowflake, BigQuery, Redshift)
  Format: Internal, optimized
  Cost: $$$  (compute + storage bundled)
  Schema: Enforced on write (schema-on-write)
  ACID: Yes
  ML/DS: Export data to separate system
  
Data Lake:
  Storage: Object storage (S3, GCS, ADLS)
  Format: Open (Parquet, ORC, JSON, CSV)
  Cost: $ (cheap storage)
  Schema: None (schema-on-read)
  ACID: No (eventual consistency, no transactions)
  ML/DS: Direct access ✓
  Problem: "Data swamp" — unreliable, slow, no governance

Data Lakehouse:
  Storage: Object storage (S3, GCS, ADLS) — cheap
  Format: Open + table format (Delta Lake, Iceberg, Hudi)
  Cost: $$ (cheap storage, efficient compute)
  Schema: Enforced (schema-on-write with evolution)
  ACID: Yes (transactions on object storage)
  ML/DS: Direct access ✓
  Best of both: Warehouse reliability + lake flexibility

Table Formats

Delta Lake (Databricks):
  Transaction log: JSON files tracking changes
  Key features:
  ☐ ACID transactions on Spark/object storage
  ☐ Schema enforcement and evolution
  ☐ Time travel (query historical versions)
  ☐ MERGE (upserts on data lake)
  ☐ Z-ordering (data layout optimization)
  
Apache Iceberg (Netflix):
  Transaction log: Manifest files with snapshot isolation
  Key features:
  ☐ Partition evolution (change partitioning without rewrite)
  ☐ Hidden partitioning (query without knowing partition scheme)
  ☐ Schema evolution (add/rename/drop columns safely)
  ☐ Time travel and rollback
  ☐ Engine-agnostic (Spark, Trino, Flink, Dremio)

Apache Hudi (Uber):
  Transaction log: Timeline metadata
  Key features:
  ☐ Upserts and deletes optimized for streaming
  ☐ Incremental processing (only changed data)
  ☐ Record-level change tracking
  ☐ Compaction for read optimization

Implementation

-- Delta Lake: Creating and querying a lakehouse table

-- Create managed Delta table
CREATE TABLE events (
    event_id STRING,
    user_id STRING,
    event_type STRING,
    event_data MAP<STRING, STRING>,
    event_timestamp TIMESTAMP
)
USING DELTA
PARTITIONED BY (date(event_timestamp))
LOCATION 's3://data-lake/events/';

-- ACID upserts (impossible on raw Parquet files)
MERGE INTO events AS target
USING new_events AS source
ON target.event_id = source.event_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

-- Time travel: Query data as it was yesterday
SELECT * FROM events TIMESTAMP AS OF '2024-03-14';

-- Version rollback: Undo a bad data load
RESTORE TABLE events TO VERSION AS OF 42;

-- Schema evolution: Add column without rewriting data
ALTER TABLE events ADD COLUMN session_id STRING;

Anti-Patterns

Anti-PatternConsequenceFix
Raw Parquet files without table formatNo ACID, no schema, data corruption riskUse Delta Lake, Iceberg, or Hudi
Over-partitionToo many small files, slow queriesPartition by date, not by every dimension
No compactionSmall files accumulate, read performance degradesRegular compaction (OPTIMIZE / rewrite)
Skip schema enforcement”Data swamp” — unreliable data qualitySchema enforcement on write
Vendor lock-in table formatCannot switch compute enginesOpen formats (Iceberg) for engine flexibility

The lakehouse is not a product — it is an architecture pattern. You combine open storage (S3), an open table format (Delta/Iceberg), and any compute engine (Spark, Trino, Flink) to get warehouse-quality analytics on data lake economics.

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →