Batch processing handles the workloads that don’t need real-time: nightly ETL, weekly reporting, model training, data backfills, and large-scale transformations. At scale, batch processing is an engineering discipline — the difference between a job that runs in 30 minutes and one that takes 8 hours is usually a few configuration and partitioning decisions.
Spark Optimization
Memory & Partitioning
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.config("spark.sql.shuffle.partitions", 200) \
.config("spark.sql.adaptive.enabled", True) \
.config("spark.sql.adaptive.coalescePartitions.enabled", True) \
.config("spark.sql.adaptive.skewJoin.enabled", True) \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.getOrCreate()
# Partition target: 128 MB per partition
def optimal_partitions(data_size_gb, target_partition_mb=128):
return max(1, int(data_size_gb * 1024 / target_partition_mb))
Common Optimizations
| Issue | Symptom | Fix |
|---|
| Too many small files | Slow reads, high overhead | coalesce() or compaction job |
| Too few partitions | OOM, low parallelism | repartition(N) based on data size |
| Data skew | One task takes 100x longer | Salted keys, broadcast join for small tables |
| Shuffles | Slow, high network I/O | Reduce shuffles, use broadcast joins |
| Full scans | Reading all data for filtered queries | Partition pruning, predicate pushdown |
Handling Data Skew
# Problem: 90% of orders belong to 10 power users
# One partition processes most of the data
# Solution: Salt the join key
from pyspark.sql.functions import concat, lit, rand, floor
SALT_BUCKETS = 10
# Salt the large (skewed) table
skewed_df = orders.withColumn(
"salted_key",
concat(col("customer_id"), lit("_"), floor(rand() * SALT_BUCKETS))
)
# Explode the small table to match all salt buckets
small_df = customers.crossJoin(
spark.range(SALT_BUCKETS).withColumnRenamed("id", "salt")
).withColumn(
"salted_key",
concat(col("customer_id"), lit("_"), col("salt"))
)
# Join on salted key (evenly distributed)
result = skewed_df.join(small_df, "salted_key")
| Format | Compression | Read Speed | Write Speed | Best For |
|---|
| Parquet | Excellent (columnar) | Fast (column pruning) | Medium | Analytics, warehouse |
| ORC | Excellent | Fast | Medium | Hive ecosystem |
| Avro | Good (row-based) | Medium | Fast | Streaming, row-level ops |
| Delta/Iceberg | Parquet + metadata | Fast + ACID | Medium | Lakehouse |
| CSV | None | Slow | Fast | Data exchange, legacy |
| JSON | None | Slow | Fast | APIs, nested data |
Batch Pipeline Architecture
┌──────────┐ ┌───────────┐ ┌──────────┐ ┌──────────┐
│ Schedule │────▶│ Extract │────▶│Transform │────▶│ Load │
│ (Airflow) │ │ (Sources) │ │ (Spark) │ │ (Sink) │
└──────────┘ └───────────┘ └──────────┘ └──────────┘
│
┌─────▼─────┐
│ Monitoring │
│ • Duration │
│ • Records │
│ • Data QA │
└───────────┘
Cost Optimization
| Strategy | Savings | How |
|---|
| Spot/preemptible instances | 60-80% | Use for fault-tolerant batch jobs |
| Right-size clusters | 20-40% | Auto-scaling based on job size |
| Partition pruning | 50-90% compute | Only read relevant partitions |
| Columnar format | 30-70% I/O | Parquet/ORC instead of CSV/JSON |
| Incremental processing | 80-95% | Process only new/changed data |
| Off-peak scheduling | 10-20% | Run during low-demand hours |
Anti-Patterns
| Anti-Pattern | Problem | Fix |
|---|
| Process everything daily | 95% of data hasn’t changed | Incremental processing with watermarks |
| Single large job | One failure restarts everything | Modular stages with checkpointing |
| No monitoring | Job fails silently for days | Duration, record count, output validation alerts |
| Wrong file format | CSV at petabyte scale | Parquet/ORC with partitioning |
| Over-provisioned clusters | Paying for idle resources | Auto-scaling, right-sizing after profiling |
Checklist
:::note[Source]
This guide is derived from operational intelligence at Garnet Grid Consulting. For batch processing consulting, visit garnetgrid.com.
:::