Data Lakehouse Architecture
Combine data lake flexibility with data warehouse performance. Covers lakehouse design principles, Delta Lake, Apache Iceberg, table formats, schema evolution, time travel, and the patterns that eliminate the data lake vs. warehouse tradeoff.
Data lakes store everything cheaply but are slow and unreliable for analytics. Data warehouses are fast and reliable but expensive and rigid. The lakehouse combines both: open file formats on cheap object storage with ACID transactions, schema enforcement, and warehouse-level query performance.
Architecture Comparison
Data Warehouse:
Storage: Proprietary (Snowflake, BigQuery, Redshift)
Format: Internal, optimized
Cost: $$$ (compute + storage bundled)
Schema: Enforced on write (schema-on-write)
ACID: Yes
ML/DS: Export data to separate system
Data Lake:
Storage: Object storage (S3, GCS, ADLS)
Format: Open (Parquet, ORC, JSON, CSV)
Cost: $ (cheap storage)
Schema: None (schema-on-read)
ACID: No (eventual consistency, no transactions)
ML/DS: Direct access ✓
Problem: "Data swamp" — unreliable, slow, no governance
Data Lakehouse:
Storage: Object storage (S3, GCS, ADLS) — cheap
Format: Open + table format (Delta Lake, Iceberg, Hudi)
Cost: $$ (cheap storage, efficient compute)
Schema: Enforced (schema-on-write with evolution)
ACID: Yes (transactions on object storage)
ML/DS: Direct access ✓
Best of both: Warehouse reliability + lake flexibility
Table Formats
Delta Lake (Databricks):
Transaction log: JSON files tracking changes
Key features:
☐ ACID transactions on Spark/object storage
☐ Schema enforcement and evolution
☐ Time travel (query historical versions)
☐ MERGE (upserts on data lake)
☐ Z-ordering (data layout optimization)
Apache Iceberg (Netflix):
Transaction log: Manifest files with snapshot isolation
Key features:
☐ Partition evolution (change partitioning without rewrite)
☐ Hidden partitioning (query without knowing partition scheme)
☐ Schema evolution (add/rename/drop columns safely)
☐ Time travel and rollback
☐ Engine-agnostic (Spark, Trino, Flink, Dremio)
Apache Hudi (Uber):
Transaction log: Timeline metadata
Key features:
☐ Upserts and deletes optimized for streaming
☐ Incremental processing (only changed data)
☐ Record-level change tracking
☐ Compaction for read optimization
Implementation
-- Delta Lake: Creating and querying a lakehouse table
-- Create managed Delta table
CREATE TABLE events (
event_id STRING,
user_id STRING,
event_type STRING,
event_data MAP<STRING, STRING>,
event_timestamp TIMESTAMP
)
USING DELTA
PARTITIONED BY (date(event_timestamp))
LOCATION 's3://data-lake/events/';
-- ACID upserts (impossible on raw Parquet files)
MERGE INTO events AS target
USING new_events AS source
ON target.event_id = source.event_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
-- Time travel: Query data as it was yesterday
SELECT * FROM events TIMESTAMP AS OF '2024-03-14';
-- Version rollback: Undo a bad data load
RESTORE TABLE events TO VERSION AS OF 42;
-- Schema evolution: Add column without rewriting data
ALTER TABLE events ADD COLUMN session_id STRING;
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| Raw Parquet files without table format | No ACID, no schema, data corruption risk | Use Delta Lake, Iceberg, or Hudi |
| Over-partition | Too many small files, slow queries | Partition by date, not by every dimension |
| No compaction | Small files accumulate, read performance degrades | Regular compaction (OPTIMIZE / rewrite) |
| Skip schema enforcement | ”Data swamp” — unreliable data quality | Schema enforcement on write |
| Vendor lock-in table format | Cannot switch compute engines | Open formats (Iceberg) for engine flexibility |
The lakehouse is not a product — it is an architecture pattern. You combine open storage (S3), an open table format (Delta/Iceberg), and any compute engine (Spark, Trino, Flink) to get warehouse-quality analytics on data lake economics.