Verified by Garnet Grid

Batch Processing at Scale

Design scalable batch processing systems. Covers Spark optimization, partitioning strategies, data skew handling, cost optimization, file format selection, and batch pipeline monitoring.

Batch processing handles the workloads that don’t need real-time: nightly ETL, weekly reporting, model training, data backfills, and large-scale transformations. At scale, batch processing is an engineering discipline — the difference between a job that runs in 30 minutes and one that takes 8 hours is usually a few configuration and partitioning decisions.


Spark Optimization

Memory & Partitioning

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .config("spark.sql.shuffle.partitions", 200) \
    .config("spark.sql.adaptive.enabled", True) \
    .config("spark.sql.adaptive.coalescePartitions.enabled", True) \
    .config("spark.sql.adaptive.skewJoin.enabled", True) \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .getOrCreate()

# Partition target: 128 MB per partition
def optimal_partitions(data_size_gb, target_partition_mb=128):
    return max(1, int(data_size_gb * 1024 / target_partition_mb))

Common Optimizations

IssueSymptomFix
Too many small filesSlow reads, high overheadcoalesce() or compaction job
Too few partitionsOOM, low parallelismrepartition(N) based on data size
Data skewOne task takes 100x longerSalted keys, broadcast join for small tables
ShufflesSlow, high network I/OReduce shuffles, use broadcast joins
Full scansReading all data for filtered queriesPartition pruning, predicate pushdown

Handling Data Skew

# Problem: 90% of orders belong to 10 power users
# One partition processes most of the data

# Solution: Salt the join key
from pyspark.sql.functions import concat, lit, rand, floor

SALT_BUCKETS = 10

# Salt the large (skewed) table
skewed_df = orders.withColumn(
    "salted_key",
    concat(col("customer_id"), lit("_"), floor(rand() * SALT_BUCKETS))
)

# Explode the small table to match all salt buckets
small_df = customers.crossJoin(
    spark.range(SALT_BUCKETS).withColumnRenamed("id", "salt")
).withColumn(
    "salted_key",
    concat(col("customer_id"), lit("_"), col("salt"))
)

# Join on salted key (evenly distributed)
result = skewed_df.join(small_df, "salted_key")

File Format Selection

FormatCompressionRead SpeedWrite SpeedBest For
ParquetExcellent (columnar)Fast (column pruning)MediumAnalytics, warehouse
ORCExcellentFastMediumHive ecosystem
AvroGood (row-based)MediumFastStreaming, row-level ops
Delta/IcebergParquet + metadataFast + ACIDMediumLakehouse
CSVNoneSlowFastData exchange, legacy
JSONNoneSlowFastAPIs, nested data

Batch Pipeline Architecture

┌──────────┐     ┌───────────┐     ┌──────────┐     ┌──────────┐
│ Schedule  │────▶│ Extract   │────▶│Transform │────▶│ Load     │
│ (Airflow) │     │ (Sources) │     │ (Spark)  │     │ (Sink)   │
└──────────┘     └───────────┘     └──────────┘     └──────────┘

                                  ┌─────▼─────┐
                                  │ Monitoring │
                                  │ • Duration │
                                  │ • Records  │
                                  │ • Data QA  │
                                  └───────────┘

Cost Optimization

StrategySavingsHow
Spot/preemptible instances60-80%Use for fault-tolerant batch jobs
Right-size clusters20-40%Auto-scaling based on job size
Partition pruning50-90% computeOnly read relevant partitions
Columnar format30-70% I/OParquet/ORC instead of CSV/JSON
Incremental processing80-95%Process only new/changed data
Off-peak scheduling10-20%Run during low-demand hours

Anti-Patterns

Anti-PatternProblemFix
Process everything daily95% of data hasn’t changedIncremental processing with watermarks
Single large jobOne failure restarts everythingModular stages with checkpointing
No monitoringJob fails silently for daysDuration, record count, output validation alerts
Wrong file formatCSV at petabyte scaleParquet/ORC with partitioning
Over-provisioned clustersPaying for idle resourcesAuto-scaling, right-sizing after profiling

Checklist

  • File format: Parquet/ORC for analytics workloads
  • Partitioning: aligned with query and filter patterns
  • Spark config: adaptive query execution, shuffle partitions tuned
  • Data skew: detected and mitigated (salting, broadcast joins)
  • Incremental: only process new/changed data where possible
  • Monitoring: job duration, record counts, data quality
  • Cost: spot instances, auto-scaling, right-sized clusters
  • Error handling: retries, checkpointing, alerting
  • Backfill: ability to reprocess historical data

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For batch processing consulting, visit garnetgrid.com. :::

Jakub Dimitri Rezayev
Jakub Dimitri Rezayev
Founder & Chief Architect • Garnet Grid Consulting

Jakub holds an M.S. in Customer Intelligence & Analytics and a B.S. in Finance & Computer Science from Pace University. With deep expertise spanning D365 F&O, Azure, Power BI, and AI/ML systems, he architects enterprise solutions that bridge legacy systems and modern technology — and has led multi-million dollar ERP implementations for Fortune 500 supply chains.

View Full Profile →