Data Lakehouse Architecture

Data lakes store everything cheaply but are slow and unreliable for analytics. Data warehouses are fast and reliable but expensive and rigid. The lakehouse combines both: open file formats on cheap object storage with ACID transactions, schema enforcement, and warehouse-level query performance.

Architecture Comparison

Data Warehouse:
  Storage: Proprietary (Snowflake, BigQuery, Redshift)
  Format: Internal, optimized
  Cost: $$$  (compute + storage bundled)
  Schema: Enforced on write (schema-on-write)
  ACID: Yes
  ML/DS: Export data to separate system
  
Data Lake:
  Storage: Object storage (S3, GCS, ADLS)
  Format: Open (Parquet, ORC, JSON, CSV)
  Cost: $ (cheap storage)
  Schema: None (schema-on-read)
  ACID: No (eventual consistency, no transactions)
  ML/DS: Direct access ✓
  Problem: "Data swamp" — unreliable, slow, no governance

Data Lakehouse:
  Storage: Object storage (S3, GCS, ADLS) — cheap
  Format: Open + table format (Delta Lake, Iceberg, Hudi)
  Cost: $$ (cheap storage, efficient compute)
  Schema: Enforced (schema-on-write with evolution)
  ACID: Yes (transactions on object storage)
  ML/DS: Direct access ✓
  Best of both: Warehouse reliability + lake flexibility

Table Formats

Delta Lake (Databricks):
  Transaction log: JSON files tracking changes
  Key features:
  ☐ ACID transactions on Spark/object storage
  ☐ Schema enforcement and evolution
  ☐ Time travel (query historical versions)
  ☐ MERGE (upserts on data lake)
  ☐ Z-ordering (data layout optimization)
  
Apache Iceberg (Netflix):
  Transaction log: Manifest files with snapshot isolation
  Key features:
  ☐ Partition evolution (change partitioning without rewrite)
  ☐ Hidden partitioning (query without knowing partition scheme)
  ☐ Schema evolution (add/rename/drop columns safely)
  ☐ Time travel and rollback
  ☐ Engine-agnostic (Spark, Trino, Flink, Dremio)

Apache Hudi (Uber):
  Transaction log: Timeline metadata
  Key features:
  ☐ Upserts and deletes optimized for streaming
  ☐ Incremental processing (only changed data)
  ☐ Record-level change tracking
  ☐ Compaction for read optimization

Implementation

-- Delta Lake: Creating and querying a lakehouse table

-- Create managed Delta table
CREATE TABLE events (
    event_id STRING,
    user_id STRING,
    event_type STRING,
    event_data MAP<STRING, STRING>,
    event_timestamp TIMESTAMP
)
USING DELTA
PARTITIONED BY (date(event_timestamp))
LOCATION 's3://data-lake/events/';

-- ACID upserts (impossible on raw Parquet files)
MERGE INTO events AS target
USING new_events AS source
ON target.event_id = source.event_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

-- Time travel: Query data as it was yesterday
SELECT * FROM events TIMESTAMP AS OF '2024-03-14';

-- Version rollback: Undo a bad data load
RESTORE TABLE events TO VERSION AS OF 42;

-- Schema evolution: Add column without rewriting data
ALTER TABLE events ADD COLUMN session_id STRING;

Anti-Patterns

Anti-Pattern	Consequence	Fix
Raw Parquet files without table format	No ACID, no schema, data corruption risk	Use Delta Lake, Iceberg, or Hudi
Over-partition	Too many small files, slow queries	Partition by date, not by every dimension
No compaction	Small files accumulate, read performance degrades	Regular compaction (OPTIMIZE / rewrite)
Skip schema enforcement	”Data swamp” — unreliable data quality	Schema enforcement on write
Vendor lock-in table format	Cannot switch compute engines	Open formats (Iceberg) for engine flexibility

The lakehouse is not a product — it is an architecture pattern. You combine open storage (S3), an open table format (Delta/Iceberg), and any compute engine (Spark, Trino, Flink) to get warehouse-quality analytics on data lake economics.

Architecture Comparison

Table Formats

Implementation

Anti-Patterns

More in Data Science

A/B Testing at Scale

A/B Testing Statistical Framework

A/B Testing Infrastructure: Making Data-Driven Decisions Without Breaking Production