ESC
Type to search guides, tutorials, and reference documentation.
Verified by Garnet Grid

Data Lake Architecture

Design and implement data lakes that scale from gigabytes to petabytes. Covers lakehouse architecture, storage formats (Parquet, Delta, Iceberg), partitioning strategies, data lifecycle management, query engines, and the patterns that prevent data lakes from becoming data swamps.

A data lake stores raw data in its native format at any scale. Unlike data warehouses that require upfront schema design, data lakes accept any data — structured, semi-structured, unstructured — and apply schema on read. The modern lakehouse combines the flexibility of data lakes with the reliability of data warehouses.


Lakehouse Architecture

                    ┌─────────────────────┐
                    │   Query Engines      │
                    │ Spark, Trino, DuckDB │
                    └──────────┬──────────┘

                    ┌──────────┴──────────┐
                    │   Table Format       │
                    │ Delta Lake / Iceberg  │
                    │ ACID, Time Travel     │
                    └──────────┬──────────┘

                    ┌──────────┴──────────┐
                    │   Storage Layer      │
                    │   S3 / GCS / ADLS    │
                    │   Open file formats  │
                    └──────────┬──────────┘

        ┌──────────┬───────────┼───────────┬──────────┐
     Parquet    ORC       JSON/CSV     Avro      Images

Storage Format Comparison

FormatBest ForCompressionSchemaQuery Speed
ParquetAnalytics (columnar)ExcellentEmbeddedFast (columnar pushdown)
ORCHive ecosystemExcellentEmbeddedFast
AvroStreaming (row-based)GoodEmbeddedMedium
Delta LakeACID on ParquetExcellentTransaction logFast
Apache IcebergMulti-engine lakehouseExcellentMetadata catalogFast
JSONSemi-structuredPoorNoneSlow

Partitioning

# Partition by date for time-series queries
# s3://data-lake/events/year=2026/month=03/day=04/

# Good: Queries filter by date → reads only relevant partitions
# SELECT * FROM events WHERE year = 2026 AND month = 3

# Bad: Too many small files (over-partition)
# s3://data-lake/events/year=2026/month=03/day=04/hour=19/minute=08/
# → 525,600 partitions per year = tiny files

# Rules of thumb:
# Each partition should contain > 128MB of data
# Total partitions < 10,000 for a table
# Partition by query patterns, not just time

Delta Lake (ACID on Data Lake)

from delta.tables import DeltaTable

# Write with ACID guarantees
df.write.format("delta").mode("append").save("s3://lake/orders")

# Time travel: query historical data
df_yesterday = spark.read.format("delta") \
    .option("timestampAsOf", "2026-03-03") \
    .load("s3://lake/orders")

# MERGE (upsert)
delta_table = DeltaTable.forPath(spark, "s3://lake/orders")
delta_table.alias("target") \
    .merge(new_data.alias("source"), "target.id = source.id") \
    .whenMatchedUpdateAll() \
    .whenNotMatchedInsertAll() \
    .execute()

# Optimize (compact small files)
delta_table.optimize().executeCompaction()

# Z-order (co-locate related data for faster queries)
delta_table.optimize().executeZOrderBy("customer_id")

# Vacuum (delete old versions)
delta_table.vacuum(168)  # Keep 7 days of history

Data Lifecycle

zones:
  raw:
    path: "s3://lake/raw/"
    format: "Original format (JSON, CSV, Avro)"
    retention: "7 years"
    access: "Data engineers only"
    purpose: "Immutable landing zone, source of truth"
    
  curated:
    path: "s3://lake/curated/"
    format: "Delta Lake / Iceberg (Parquet)"
    retention: "5 years"
    access: "Data team + approved analysts"
    purpose: "Cleaned, validated, typed"
    
  aggregated:
    path: "s3://lake/aggregated/"
    format: "Delta Lake (Parquet)"
    retention: "3 years"
    access: "All analysts and dashboards"
    purpose: "Pre-computed metrics, reports"

Anti-Patterns

Anti-PatternConsequenceFix
No schema enforcementData becomes unusable swampSchema-on-write for curated zone
Tiny files (< 1MB each)Slow queries, high metadata overheadCompaction, proper batch sizes
No partitioning strategyFull table scan for every queryPartition by primary query filter
No data lifecycleStorage costs grow foreverTiered storage, retention policies
Single format for everythingSuboptimal for different access patternsColumnar for analytics, row for streaming

A data lake without governance becomes a data swamp. The difference is structure: clear zones, enforced schemas, lifecycle management, and access controls.

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →