Data Lakehouse Architecture
Design modern data lakehouses. Covers Delta Lake, Apache Iceberg, Hudi, medallion architecture, ACID guarantees on object storage, time travel, schema evolution, and performance optimization.
The data lakehouse combines the flexibility of a data lake (cheap storage, any format) with the reliability of a data warehouse (ACID transactions, schema enforcement, SQL queries). Before lakehouses, organizations maintained separate lakes and warehouses — ETL-ing data from one to the other, paying for double storage, and fighting data consistency issues between the two.
Table formats like Delta Lake, Apache Iceberg, and Apache Hudi add a metadata layer on top of object storage (S3, GCS, ADLS), enabling warehouse-grade reliability on lake-grade economics.
Lakehouse vs Lake vs Warehouse
| Capability | Data Lake | Data Warehouse | Data Lakehouse |
|---|---|---|---|
| Storage cost | $0.023/GB/mo (S3) | $5-23/TB/mo (compressed) | $0.023/GB/mo (S3) |
| ACID transactions | ❌ | ✅ | ✅ (via table format) |
| Schema enforcement | ❌ | ✅ | ✅ |
| SQL queries | Slow (scan all files) | Fast (indexed) | Fast (metadata pruning) |
| Time travel | ❌ | Limited | ✅ (version history) |
| Open format | ✅ (Parquet, ORC) | ❌ (proprietary) | ✅ (Parquet + metadata) |
| ML support | ✅ | Limited | ✅ |
Medallion Architecture
┌─────────┐ ┌──────────┐ ┌──────────┐
│ BRONZE │────▶│ SILVER │────▶│ GOLD │
│ (Raw) │ │ (Cleaned) │ │ (Business)│
├─────────┤ ├──────────┤ ├──────────┤
│ Append- │ │ Dedupe, │ │ Aggregated│
│ only, │ │ validate,│ │ KPIs, │
│ schema- │ │ conform │ │ dimensions│
│ on-read │ │ schema │ │ facts │
└─────────┘ └──────────┘ └──────────┘
s3://lake/ s3://lake/ s3://lake/
bronze/ silver/ gold/
Implementation with Delta Lake
from delta.tables import DeltaTable
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
# BRONZE: Ingest raw data (append-only)
raw_df = spark.readStream \
.format("kafka") \
.option("subscribe", "orders") \
.load()
raw_df.writeStream \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "s3://lake/checkpoints/bronze_orders") \
.start("s3://lake/bronze/orders")
# SILVER: Clean and deduplicate
bronze_df = spark.read.format("delta").load("s3://lake/bronze/orders")
silver_df = bronze_df \
.dropDuplicates(["order_id"]) \
.filter("amount > 0") \
.withColumn("processed_at", current_timestamp())
# Upsert (merge) into silver table
silver_table = DeltaTable.forPath(spark, "s3://lake/silver/orders")
silver_table.alias("target") \
.merge(silver_df.alias("source"), "target.order_id = source.order_id") \
.whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()
# GOLD: Business aggregations
gold_df = spark.sql("""
SELECT
date_trunc('day', order_date) as order_date,
category,
COUNT(*) as order_count,
SUM(amount) as revenue,
AVG(amount) as avg_order_value,
COUNT(DISTINCT customer_id) as unique_customers
FROM delta.`s3://lake/silver/orders`
GROUP BY 1, 2
""")
gold_df.write.format("delta") \
.mode("overwrite") \
.option("overwriteSchema", "true") \
.save("s3://lake/gold/daily_revenue")
Table Format Comparison
| Feature | Delta Lake | Apache Iceberg | Apache Hudi |
|---|---|---|---|
| Backed by | Databricks | Netflix/Apple/AWS | Uber |
| ACID | ✅ | ✅ | ✅ |
| Time travel | ✅ (version history) | ✅ (snapshot-based) | ✅ (timeline) |
| Schema evolution | ✅ | ✅ (full) | ✅ |
| Partition evolution | Limited | ✅ (hidden partitions) | ✅ |
| Upserts (MERGE) | ✅ | ✅ | ✅ (optimized) |
| Engine support | Spark, Trino, Flink | Spark, Trino, Flink, Dremio | Spark, Flink, Trino |
| Best for | Databricks ecosystems | Multi-engine, large scale | CDC/streaming workloads |
Performance Optimization
File Compaction
# Delta Lake: Optimize small files
spark.sql("""
OPTIMIZE delta.`s3://lake/silver/orders`
ZORDER BY (customer_id, order_date)
""")
# Vacuum old files (retain 7 days for time travel)
spark.sql("""
VACUUM delta.`s3://lake/silver/orders`
RETAIN 168 HOURS
""")
Partition Strategy
| Data Volume | Partition By | File Target Size |
|---|---|---|
| < 1 GB/day | No partition | 256 MB - 1 GB |
| 1-100 GB/day | Date (daily) | 256 MB - 1 GB |
| 100 GB - 1 TB/day | Date + category | 256 MB - 1 GB |
| > 1 TB/day | Date + hour + key hash | 256 MB - 1 GB |
Anti-Patterns
| Anti-Pattern | Problem | Fix |
|---|---|---|
| Small files | Thousands of tiny files kill query performance | Compaction jobs, optimize write batch sizes |
| Over-partitioning | Millions of partitions, each with 1 MB | Partition by high-cardinality only when queries filter by it |
| No schema evolution plan | Breaking changes corrupt downstream consumers | Use Iceberg for partition evolution, schema validation |
| Treating lake as dump | Raw zone becomes swamp | Medallion architecture with quality gates |
| No metadata management | Nobody knows what’s in the lake | Data catalog (Hive Metastore, AWS Glue, Unity Catalog) |
| Skip silver, go raw→gold | Difficult to debug, no reprocessing path | Always clean data in silver before aggregating in gold |
Checklist
- Table format selected (Delta Lake, Iceberg, or Hudi)
- Medallion architecture: bronze, silver, gold layers defined
- Schema enforcement: schemas validated at each layer
- Partitioning strategy: aligned with query patterns
- Compaction: scheduled optimization to prevent small files
- Time travel: retention policy configured (7-30 days)
- Data catalog: all tables registered with descriptions
- Access control: column-level and row-level security
- Monitoring: data freshness, quality scores, file sizes
- Cost tracking: storage costs per layer, compute per job
:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For data lakehouse consulting, visit garnetgrid.com. :::