Data Lakehouse Architecture

The data lakehouse combines the flexibility of a data lake (cheap storage, any format) with the reliability of a data warehouse (ACID transactions, schema enforcement, SQL queries). Before lakehouses, organizations maintained separate lakes and warehouses — ETL-ing data from one to the other, paying for double storage, and fighting data consistency issues between the two.

Table formats like Delta Lake, Apache Iceberg, and Apache Hudi add a metadata layer on top of object storage (S3, GCS, ADLS), enabling warehouse-grade reliability on lake-grade economics.

Lakehouse vs Lake vs Warehouse

Capability	Data Lake	Data Warehouse	Data Lakehouse
Storage cost	$0.023/GB/mo (S3)	$5-23/TB/mo (compressed)	$0.023/GB/mo (S3)
ACID transactions	❌	✅	✅ (via table format)
Schema enforcement	❌	✅	✅
SQL queries	Slow (scan all files)	Fast (indexed)	Fast (metadata pruning)
Time travel	❌	Limited	✅ (version history)
Open format	✅ (Parquet, ORC)	❌ (proprietary)	✅ (Parquet + metadata)
ML support	✅	Limited	✅

Medallion Architecture

┌─────────┐     ┌──────────┐     ┌──────────┐
│ BRONZE   │────▶│ SILVER    │────▶│ GOLD      │
│ (Raw)    │     │ (Cleaned) │     │ (Business)│
├─────────┤     ├──────────┤     ├──────────┤
│ Append-  │     │ Dedupe,  │     │ Aggregated│
│ only,    │     │ validate,│     │ KPIs,     │
│ schema-  │     │ conform  │     │ dimensions│
│ on-read  │     │ schema   │     │ facts     │
└─────────┘     └──────────┘     └──────────┘
 s3://lake/      s3://lake/       s3://lake/
 bronze/         silver/          gold/

Implementation with Delta Lake

from delta.tables import DeltaTable
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# BRONZE: Ingest raw data (append-only)
raw_df = spark.readStream \
    .format("kafka") \
    .option("subscribe", "orders") \
    .load()

raw_df.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "s3://lake/checkpoints/bronze_orders") \
    .start("s3://lake/bronze/orders")

# SILVER: Clean and deduplicate
bronze_df = spark.read.format("delta").load("s3://lake/bronze/orders")

silver_df = bronze_df \
    .dropDuplicates(["order_id"]) \
    .filter("amount > 0") \
    .withColumn("processed_at", current_timestamp())

# Upsert (merge) into silver table
silver_table = DeltaTable.forPath(spark, "s3://lake/silver/orders")
silver_table.alias("target") \
    .merge(silver_df.alias("source"), "target.order_id = source.order_id") \
    .whenMatchedUpdateAll() \
    .whenNotMatchedInsertAll() \
    .execute()

# GOLD: Business aggregations
gold_df = spark.sql("""
    SELECT
        date_trunc('day', order_date) as order_date,
        category,
        COUNT(*) as order_count,
        SUM(amount) as revenue,
        AVG(amount) as avg_order_value,
        COUNT(DISTINCT customer_id) as unique_customers
    FROM delta.`s3://lake/silver/orders`
    GROUP BY 1, 2
""")

gold_df.write.format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .save("s3://lake/gold/daily_revenue")

Table Format Comparison

Feature	Delta Lake	Apache Iceberg	Apache Hudi
Backed by	Databricks	Netflix/Apple/AWS	Uber
ACID	✅	✅	✅
Time travel	✅ (version history)	✅ (snapshot-based)	✅ (timeline)
Schema evolution	✅	✅ (full)	✅
Partition evolution	Limited	✅ (hidden partitions)	✅
Upserts (MERGE)	✅	✅	✅ (optimized)
Engine support	Spark, Trino, Flink	Spark, Trino, Flink, Dremio	Spark, Flink, Trino
Best for	Databricks ecosystems	Multi-engine, large scale	CDC/streaming workloads

Performance Optimization

File Compaction

# Delta Lake: Optimize small files
spark.sql("""
    OPTIMIZE delta.`s3://lake/silver/orders`
    ZORDER BY (customer_id, order_date)
""")

# Vacuum old files (retain 7 days for time travel)
spark.sql("""
    VACUUM delta.`s3://lake/silver/orders`
    RETAIN 168 HOURS
""")

Partition Strategy

Data Volume	Partition By	File Target Size
< 1 GB/day	No partition	256 MB - 1 GB
1-100 GB/day	Date (daily)	256 MB - 1 GB
100 GB - 1 TB/day	Date + category	256 MB - 1 GB
> 1 TB/day	Date + hour + key hash	256 MB - 1 GB

Anti-Patterns

Anti-Pattern	Problem	Fix
Small files	Thousands of tiny files kill query performance	Compaction jobs, optimize write batch sizes
Over-partitioning	Millions of partitions, each with 1 MB	Partition by high-cardinality only when queries filter by it
No schema evolution plan	Breaking changes corrupt downstream consumers	Use Iceberg for partition evolution, schema validation
Treating lake as dump	Raw zone becomes swamp	Medallion architecture with quality gates
No metadata management	Nobody knows what’s in the lake	Data catalog (Hive Metastore, AWS Glue, Unity Catalog)
Skip silver, go raw→gold	Difficult to debug, no reprocessing path	Always clean data in silver before aggregating in gold

Checklist

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For data lakehouse consulting, visit garnetgrid.com. :::