Data Lake Architecture | The Garnet Wiki

A data lake stores raw data in its native format at any scale. Unlike data warehouses that require upfront schema design, data lakes accept any data — structured, semi-structured, unstructured — and apply schema on read. The modern lakehouse combines the flexibility of data lakes with the reliability of data warehouses.

Lakehouse Architecture

                    ┌─────────────────────┐
                    │   Query Engines      │
                    │ Spark, Trino, DuckDB │
                    └──────────┬──────────┘
                               │
                    ┌──────────┴──────────┐
                    │   Table Format       │
                    │ Delta Lake / Iceberg  │
                    │ ACID, Time Travel     │
                    └──────────┬──────────┘
                               │
                    ┌──────────┴──────────┐
                    │   Storage Layer      │
                    │   S3 / GCS / ADLS    │
                    │   Open file formats  │
                    └──────────┬──────────┘
                               │
        ┌──────────┬───────────┼───────────┬──────────┐
     Parquet    ORC       JSON/CSV     Avro      Images

Storage Format Comparison

Format	Best For	Compression	Schema	Query Speed
Parquet	Analytics (columnar)	Excellent	Embedded	Fast (columnar pushdown)
ORC	Hive ecosystem	Excellent	Embedded	Fast
Avro	Streaming (row-based)	Good	Embedded	Medium
Delta Lake	ACID on Parquet	Excellent	Transaction log	Fast
Apache Iceberg	Multi-engine lakehouse	Excellent	Metadata catalog	Fast
JSON	Semi-structured	Poor	None	Slow

Partitioning

# Partition by date for time-series queries
# s3://data-lake/events/year=2026/month=03/day=04/

# Good: Queries filter by date → reads only relevant partitions
# SELECT * FROM events WHERE year = 2026 AND month = 3

# Bad: Too many small files (over-partition)
# s3://data-lake/events/year=2026/month=03/day=04/hour=19/minute=08/
# → 525,600 partitions per year = tiny files

# Rules of thumb:
# Each partition should contain > 128MB of data
# Total partitions < 10,000 for a table
# Partition by query patterns, not just time

Delta Lake (ACID on Data Lake)

from delta.tables import DeltaTable

# Write with ACID guarantees
df.write.format("delta").mode("append").save("s3://lake/orders")

# Time travel: query historical data
df_yesterday = spark.read.format("delta") \
    .option("timestampAsOf", "2026-03-03") \
    .load("s3://lake/orders")

# MERGE (upsert)
delta_table = DeltaTable.forPath(spark, "s3://lake/orders")
delta_table.alias("target") \
    .merge(new_data.alias("source"), "target.id = source.id") \
    .whenMatchedUpdateAll() \
    .whenNotMatchedInsertAll() \
    .execute()

# Optimize (compact small files)
delta_table.optimize().executeCompaction()

# Z-order (co-locate related data for faster queries)
delta_table.optimize().executeZOrderBy("customer_id")

# Vacuum (delete old versions)
delta_table.vacuum(168)  # Keep 7 days of history

Data Lifecycle

zones:
  raw:
    path: "s3://lake/raw/"
    format: "Original format (JSON, CSV, Avro)"
    retention: "7 years"
    access: "Data engineers only"
    purpose: "Immutable landing zone, source of truth"
    
  curated:
    path: "s3://lake/curated/"
    format: "Delta Lake / Iceberg (Parquet)"
    retention: "5 years"
    access: "Data team + approved analysts"
    purpose: "Cleaned, validated, typed"
    
  aggregated:
    path: "s3://lake/aggregated/"
    format: "Delta Lake (Parquet)"
    retention: "3 years"
    access: "All analysts and dashboards"
    purpose: "Pre-computed metrics, reports"

Anti-Patterns

Anti-Pattern	Consequence	Fix
No schema enforcement	Data becomes unusable swamp	Schema-on-write for curated zone
Tiny files (< 1MB each)	Slow queries, high metadata overhead	Compaction, proper batch sizes
No partitioning strategy	Full table scan for every query	Partition by primary query filter
No data lifecycle	Storage costs grow forever	Tiered storage, retention policies
Single format for everything	Suboptimal for different access patterns	Columnar for analytics, row for streaming

A data lake without governance becomes a data swamp. The difference is structure: clear zones, enforced schemas, lifecycle management, and access controls.

Lakehouse Architecture

Storage Format Comparison

Partitioning

Delta Lake (ACID on Data Lake)

Data Lifecycle

Anti-Patterns

More in Data Engineering

CDC Pipeline Architecture

Change Data Capture (CDC) Patterns

Batch Processing at Scale