Data Lake Architecture
Design and implement data lakes that scale from gigabytes to petabytes. Covers lakehouse architecture, storage formats (Parquet, Delta, Iceberg), partitioning strategies, data lifecycle management, query engines, and the patterns that prevent data lakes from becoming data swamps.
A data lake stores raw data in its native format at any scale. Unlike data warehouses that require upfront schema design, data lakes accept any data — structured, semi-structured, unstructured — and apply schema on read. The modern lakehouse combines the flexibility of data lakes with the reliability of data warehouses.
Lakehouse Architecture
┌─────────────────────┐
│ Query Engines │
│ Spark, Trino, DuckDB │
└──────────┬──────────┘
│
┌──────────┴──────────┐
│ Table Format │
│ Delta Lake / Iceberg │
│ ACID, Time Travel │
└──────────┬──────────┘
│
┌──────────┴──────────┐
│ Storage Layer │
│ S3 / GCS / ADLS │
│ Open file formats │
└──────────┬──────────┘
│
┌──────────┬───────────┼───────────┬──────────┐
Parquet ORC JSON/CSV Avro Images
Storage Format Comparison
| Format | Best For | Compression | Schema | Query Speed |
|---|---|---|---|---|
| Parquet | Analytics (columnar) | Excellent | Embedded | Fast (columnar pushdown) |
| ORC | Hive ecosystem | Excellent | Embedded | Fast |
| Avro | Streaming (row-based) | Good | Embedded | Medium |
| Delta Lake | ACID on Parquet | Excellent | Transaction log | Fast |
| Apache Iceberg | Multi-engine lakehouse | Excellent | Metadata catalog | Fast |
| JSON | Semi-structured | Poor | None | Slow |
Partitioning
# Partition by date for time-series queries
# s3://data-lake/events/year=2026/month=03/day=04/
# Good: Queries filter by date → reads only relevant partitions
# SELECT * FROM events WHERE year = 2026 AND month = 3
# Bad: Too many small files (over-partition)
# s3://data-lake/events/year=2026/month=03/day=04/hour=19/minute=08/
# → 525,600 partitions per year = tiny files
# Rules of thumb:
# Each partition should contain > 128MB of data
# Total partitions < 10,000 for a table
# Partition by query patterns, not just time
Delta Lake (ACID on Data Lake)
from delta.tables import DeltaTable
# Write with ACID guarantees
df.write.format("delta").mode("append").save("s3://lake/orders")
# Time travel: query historical data
df_yesterday = spark.read.format("delta") \
.option("timestampAsOf", "2026-03-03") \
.load("s3://lake/orders")
# MERGE (upsert)
delta_table = DeltaTable.forPath(spark, "s3://lake/orders")
delta_table.alias("target") \
.merge(new_data.alias("source"), "target.id = source.id") \
.whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()
# Optimize (compact small files)
delta_table.optimize().executeCompaction()
# Z-order (co-locate related data for faster queries)
delta_table.optimize().executeZOrderBy("customer_id")
# Vacuum (delete old versions)
delta_table.vacuum(168) # Keep 7 days of history
Data Lifecycle
zones:
raw:
path: "s3://lake/raw/"
format: "Original format (JSON, CSV, Avro)"
retention: "7 years"
access: "Data engineers only"
purpose: "Immutable landing zone, source of truth"
curated:
path: "s3://lake/curated/"
format: "Delta Lake / Iceberg (Parquet)"
retention: "5 years"
access: "Data team + approved analysts"
purpose: "Cleaned, validated, typed"
aggregated:
path: "s3://lake/aggregated/"
format: "Delta Lake (Parquet)"
retention: "3 years"
access: "All analysts and dashboards"
purpose: "Pre-computed metrics, reports"
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| No schema enforcement | Data becomes unusable swamp | Schema-on-write for curated zone |
| Tiny files (< 1MB each) | Slow queries, high metadata overhead | Compaction, proper batch sizes |
| No partitioning strategy | Full table scan for every query | Partition by primary query filter |
| No data lifecycle | Storage costs grow forever | Tiered storage, retention policies |
| Single format for everything | Suboptimal for different access patterns | Columnar for analytics, row for streaming |
A data lake without governance becomes a data swamp. The difference is structure: clear zones, enforced schemas, lifecycle management, and access controls.