Verified by Garnet Grid

Data Lakehouse Architecture

Design modern data lakehouses. Covers Delta Lake, Apache Iceberg, Hudi, medallion architecture, ACID guarantees on object storage, time travel, schema evolution, and performance optimization.

The data lakehouse combines the flexibility of a data lake (cheap storage, any format) with the reliability of a data warehouse (ACID transactions, schema enforcement, SQL queries). Before lakehouses, organizations maintained separate lakes and warehouses — ETL-ing data from one to the other, paying for double storage, and fighting data consistency issues between the two.

Table formats like Delta Lake, Apache Iceberg, and Apache Hudi add a metadata layer on top of object storage (S3, GCS, ADLS), enabling warehouse-grade reliability on lake-grade economics.


Lakehouse vs Lake vs Warehouse

CapabilityData LakeData WarehouseData Lakehouse
Storage cost$0.023/GB/mo (S3)$5-23/TB/mo (compressed)$0.023/GB/mo (S3)
ACID transactions✅ (via table format)
Schema enforcement
SQL queriesSlow (scan all files)Fast (indexed)Fast (metadata pruning)
Time travelLimited✅ (version history)
Open format✅ (Parquet, ORC)❌ (proprietary)✅ (Parquet + metadata)
ML supportLimited

Medallion Architecture

┌─────────┐     ┌──────────┐     ┌──────────┐
│ BRONZE   │────▶│ SILVER    │────▶│ GOLD      │
│ (Raw)    │     │ (Cleaned) │     │ (Business)│
├─────────┤     ├──────────┤     ├──────────┤
│ Append-  │     │ Dedupe,  │     │ Aggregated│
│ only,    │     │ validate,│     │ KPIs,     │
│ schema-  │     │ conform  │     │ dimensions│
│ on-read  │     │ schema   │     │ facts     │
└─────────┘     └──────────┘     └──────────┘
 s3://lake/      s3://lake/       s3://lake/
 bronze/         silver/          gold/

Implementation with Delta Lake

from delta.tables import DeltaTable
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# BRONZE: Ingest raw data (append-only)
raw_df = spark.readStream \
    .format("kafka") \
    .option("subscribe", "orders") \
    .load()

raw_df.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "s3://lake/checkpoints/bronze_orders") \
    .start("s3://lake/bronze/orders")

# SILVER: Clean and deduplicate
bronze_df = spark.read.format("delta").load("s3://lake/bronze/orders")

silver_df = bronze_df \
    .dropDuplicates(["order_id"]) \
    .filter("amount > 0") \
    .withColumn("processed_at", current_timestamp())

# Upsert (merge) into silver table
silver_table = DeltaTable.forPath(spark, "s3://lake/silver/orders")
silver_table.alias("target") \
    .merge(silver_df.alias("source"), "target.order_id = source.order_id") \
    .whenMatchedUpdateAll() \
    .whenNotMatchedInsertAll() \
    .execute()

# GOLD: Business aggregations
gold_df = spark.sql("""
    SELECT
        date_trunc('day', order_date) as order_date,
        category,
        COUNT(*) as order_count,
        SUM(amount) as revenue,
        AVG(amount) as avg_order_value,
        COUNT(DISTINCT customer_id) as unique_customers
    FROM delta.`s3://lake/silver/orders`
    GROUP BY 1, 2
""")

gold_df.write.format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .save("s3://lake/gold/daily_revenue")

Table Format Comparison

FeatureDelta LakeApache IcebergApache Hudi
Backed byDatabricksNetflix/Apple/AWSUber
ACID
Time travel✅ (version history)✅ (snapshot-based)✅ (timeline)
Schema evolution✅ (full)
Partition evolutionLimited✅ (hidden partitions)
Upserts (MERGE)✅ (optimized)
Engine supportSpark, Trino, FlinkSpark, Trino, Flink, DremioSpark, Flink, Trino
Best forDatabricks ecosystemsMulti-engine, large scaleCDC/streaming workloads

Performance Optimization

File Compaction

# Delta Lake: Optimize small files
spark.sql("""
    OPTIMIZE delta.`s3://lake/silver/orders`
    ZORDER BY (customer_id, order_date)
""")

# Vacuum old files (retain 7 days for time travel)
spark.sql("""
    VACUUM delta.`s3://lake/silver/orders`
    RETAIN 168 HOURS
""")

Partition Strategy

Data VolumePartition ByFile Target Size
< 1 GB/dayNo partition256 MB - 1 GB
1-100 GB/dayDate (daily)256 MB - 1 GB
100 GB - 1 TB/dayDate + category256 MB - 1 GB
> 1 TB/dayDate + hour + key hash256 MB - 1 GB

Anti-Patterns

Anti-PatternProblemFix
Small filesThousands of tiny files kill query performanceCompaction jobs, optimize write batch sizes
Over-partitioningMillions of partitions, each with 1 MBPartition by high-cardinality only when queries filter by it
No schema evolution planBreaking changes corrupt downstream consumersUse Iceberg for partition evolution, schema validation
Treating lake as dumpRaw zone becomes swampMedallion architecture with quality gates
No metadata managementNobody knows what’s in the lakeData catalog (Hive Metastore, AWS Glue, Unity Catalog)
Skip silver, go raw→goldDifficult to debug, no reprocessing pathAlways clean data in silver before aggregating in gold

Checklist

  • Table format selected (Delta Lake, Iceberg, or Hudi)
  • Medallion architecture: bronze, silver, gold layers defined
  • Schema enforcement: schemas validated at each layer
  • Partitioning strategy: aligned with query patterns
  • Compaction: scheduled optimization to prevent small files
  • Time travel: retention policy configured (7-30 days)
  • Data catalog: all tables registered with descriptions
  • Access control: column-level and row-level security
  • Monitoring: data freshness, quality scores, file sizes
  • Cost tracking: storage costs per layer, compute per job

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For data lakehouse consulting, visit garnetgrid.com. :::

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →