Cost-Optimized Data Pipelines
Design data pipelines that minimize cloud costs without sacrificing reliability. Covers compute right-sizing, storage tiering, query optimization, batch vs streaming cost trade-offs, and the FinOps patterns that keep data infrastructure costs predictable.
Data pipelines are among the most expensive cloud workloads. A poorly optimized Spark job can cost 10x more than a well-tuned one processing the same data. A data warehouse with no lifecycle management grows storage costs linearly forever. Cost-optimized pipelines require understanding where money is spent and making deliberate trade-offs.
Where Data Pipeline Money Goes
Compute (40-60% of cost):
- Spark/Flink cluster hours
- Serverless function invocations
- Database query compute
Storage (20-30% of cost):
- Raw data landing zone
- Transformed/curated data
- Snapshots and backups
- Intermediate/temp data (often forgotten)
Data Transfer (10-20% of cost):
- Cross-region replication
- Cross-cloud movement
- API egress
Managed Services (10-15% of cost):
- Airflow/Prefect orchestration
- Schema registry
- Monitoring/logging
Compute Optimization
# Right-size Spark clusters
# BAD: Fixed large cluster for all jobs
spark_config_bad = {
"num_workers": 20,
"instance_type": "m5.4xlarge", # $0.768/hr × 20 = $15.36/hr
"always_on": True,
}
# GOOD: Auto-scaling, right-sized per job
spark_config_good = {
"min_workers": 2,
"max_workers": 20,
"instance_type": "m5.xlarge", # $0.192/hr
"spot_instances": True, # 60-90% cheaper
"auto_terminate_minutes": 10, # Shut down when idle
"cluster_per_job": True, # Ephemeral clusters
}
# Cost comparison:
# Bad: $15.36/hr × 24h = $368/day (always on)
# Good: $0.192/hr × 10 workers × 2h/job × 0.3 (spot) = $1.15/job
Storage Tiering
# Lifecycle rules by data zone
storage_tiers:
hot: # < 30 days old, frequently queried
storage: S3 Standard / GCS Standard
cost: $0.023/GB/month
access: Instant
warm: # 30-90 days old, occasional queries
storage: S3 Infrequent Access / GCS Nearline
cost: $0.0125/GB/month (46% cheaper)
access: Instant (retrieval fee per request)
cold: # 90-365 days old, rare queries
storage: S3 Glacier Instant / GCS Coldline
cost: $0.004/GB/month (83% cheaper)
access: Instant (higher retrieval fee)
archive: # > 365 days, compliance/audit only
storage: S3 Glacier Deep Archive / GCS Archive
cost: $0.00099/GB/month (96% cheaper)
access: 12+ hours retrieval time
# 10TB data lake cost comparison (per month):
# All Standard: $230/month
# With tiering: $60/month (74% reduction)
Query Optimization
-- BAD: Scan entire table
SELECT * FROM events WHERE event_date = '2026-03-04';
-- GOOD: Partition pruning
SELECT event_id, user_id, event_type
FROM events
WHERE year = 2026 AND month = 3 AND day = 4;
-- Scans 1 partition instead of 365
-- BAD: Full scan with expensive function
SELECT * FROM orders WHERE UPPER(status) = 'PENDING';
-- GOOD: Materialized column or index
SELECT * FROM orders WHERE status = 'pending';
-- Cost impact at scale:
-- 1TB table, BigQuery on-demand: $5/full scan
-- With partition pruning: $0.014/scan (1 day partition)
-- 100 queries/day savings: $499/day = $14,970/month
Batch vs Streaming Cost
Batch (hourly/daily):
Cost: Low compute, efficient bulk processing
Latency: Minutes to hours
Best for: Analytics, reporting, ML training
Tool: Spark, dbt, Airflow
Micro-batch (5-15 minute windows):
Cost: Medium compute, good efficiency
Latency: Minutes
Best for: Near-real-time dashboards
Tool: Spark Structured Streaming
Real-time streaming:
Cost: High (always-on compute)
Latency: Seconds
Best for: Fraud detection, real-time pricing
Tool: Flink, Kafka Streams
Rule: Use the LEAST real-time processing you can tolerate.
Every step toward real-time increases cost 5-10x.
Anti-Patterns
| Anti-Pattern | Consequence | Fix |
|---|---|---|
| Always-on Spark clusters | Paying for idle compute | Ephemeral clusters, auto-terminate |
| No storage lifecycle | Costs grow forever | Automated tiering policies |
| All data in hot storage | 4x storage cost premium | Tier by access frequency |
| SELECT * in pipelines | Processing 10x needed data | Select columns, filter early |
| Streaming when batch works | 5-10x compute cost | Batch unless latency < 5 min required |
The cheapest data pipeline is the one that processes the minimum data, at the maximum latency your business can tolerate, on the cheapest compute that delivers acceptable performance.