Batch Processing at Scale | The Garnet Wiki

Batch processing handles the workloads that don’t need real-time: nightly ETL, weekly reporting, model training, data backfills, and large-scale transformations. At scale, batch processing is an engineering discipline — the difference between a job that runs in 30 minutes and one that takes 8 hours is usually a few configuration and partitioning decisions.

Spark Optimization

Memory & Partitioning

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .config("spark.sql.shuffle.partitions", 200) \
    .config("spark.sql.adaptive.enabled", True) \
    .config("spark.sql.adaptive.coalescePartitions.enabled", True) \
    .config("spark.sql.adaptive.skewJoin.enabled", True) \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .getOrCreate()

# Partition target: 128 MB per partition
def optimal_partitions(data_size_gb, target_partition_mb=128):
    return max(1, int(data_size_gb * 1024 / target_partition_mb))

Common Optimizations

Issue	Symptom	Fix
Too many small files	Slow reads, high overhead	`coalesce()` or compaction job
Too few partitions	OOM, low parallelism	`repartition(N)` based on data size
Data skew	One task takes 100x longer	Salted keys, broadcast join for small tables
Shuffles	Slow, high network I/O	Reduce shuffles, use broadcast joins
Full scans	Reading all data for filtered queries	Partition pruning, predicate pushdown

Handling Data Skew

# Problem: 90% of orders belong to 10 power users
# One partition processes most of the data

# Solution: Salt the join key
from pyspark.sql.functions import concat, lit, rand, floor

SALT_BUCKETS = 10

# Salt the large (skewed) table
skewed_df = orders.withColumn(
    "salted_key",
    concat(col("customer_id"), lit("_"), floor(rand() * SALT_BUCKETS))
)

# Explode the small table to match all salt buckets
small_df = customers.crossJoin(
    spark.range(SALT_BUCKETS).withColumnRenamed("id", "salt")
).withColumn(
    "salted_key",
    concat(col("customer_id"), lit("_"), col("salt"))
)

# Join on salted key (evenly distributed)
result = skewed_df.join(small_df, "salted_key")

File Format Selection

Format	Compression	Read Speed	Write Speed	Best For
Parquet	Excellent (columnar)	Fast (column pruning)	Medium	Analytics, warehouse
ORC	Excellent	Fast	Medium	Hive ecosystem
Avro	Good (row-based)	Medium	Fast	Streaming, row-level ops
Delta/Iceberg	Parquet + metadata	Fast + ACID	Medium	Lakehouse
CSV	None	Slow	Fast	Data exchange, legacy
JSON	None	Slow	Fast	APIs, nested data

Batch Pipeline Architecture

┌──────────┐     ┌───────────┐     ┌──────────┐     ┌──────────┐
│ Schedule  │────▶│ Extract   │────▶│Transform │────▶│ Load     │
│ (Airflow) │     │ (Sources) │     │ (Spark)  │     │ (Sink)   │
└──────────┘     └───────────┘     └──────────┘     └──────────┘
                                        │
                                  ┌─────▼─────┐
                                  │ Monitoring │
                                  │ • Duration │
                                  │ • Records  │
                                  │ • Data QA  │
                                  └───────────┘

Cost Optimization

Strategy	Savings	How
Spot/preemptible instances	60-80%	Use for fault-tolerant batch jobs
Right-size clusters	20-40%	Auto-scaling based on job size
Partition pruning	50-90% compute	Only read relevant partitions
Columnar format	30-70% I/O	Parquet/ORC instead of CSV/JSON
Incremental processing	80-95%	Process only new/changed data
Off-peak scheduling	10-20%	Run during low-demand hours

Anti-Patterns

Anti-Pattern	Problem	Fix
Process everything daily	95% of data hasn’t changed	Incremental processing with watermarks
Single large job	One failure restarts everything	Modular stages with checkpointing
No monitoring	Job fails silently for days	Duration, record count, output validation alerts
Wrong file format	CSV at petabyte scale	Parquet/ORC with partitioning
Over-provisioned clusters	Paying for idle resources	Auto-scaling, right-sizing after profiling

Checklist

File format: Parquet/ORC for analytics workloads
Partitioning: aligned with query and filter patterns
Spark config: adaptive query execution, shuffle partitions tuned
Data skew: detected and mitigated (salting, broadcast joins)
Incremental: only process new/changed data where possible
Monitoring: job duration, record counts, data quality
Cost: spot instances, auto-scaling, right-sized clusters
Error handling: retries, checkpointing, alerting
Backfill: ability to reprocess historical data

:::note[Source] This guide is derived from operational intelligence at Garnet Grid Consulting. For batch processing consulting, visit garnetgrid.com. :::