In This Article
- Spark Internals: Driver, Executors, and the DAG
- Cluster Sizing: How Many Nodes, How Much Memory
- Partitioning Strategy: The Key to Parallel Performance
- The Shuffle Problem: Why Joins and GroupBys Are Expensive
- Memory Management: When Spark Spills to Disk
- 7 Performance Tuning Techniques
- Spark + Delta Lake: Optimized Storage Access
- Spark Job Monitoring and Debugging
- Go Deeper
Spark Internals: Driver, Executors, and the DAG
Understanding Spark performance requires understanding its execution model. A Spark application has three components:
Driver: The coordinator. Receives your code (PySpark, Scala, SQL), creates the execution plan (Directed Acyclic Graph — DAG), distributes work to executors, and collects results. The driver runs on one node. It doesn't process data — it orchestrates processing. Driver failures crash the entire job; size the driver with enough memory for: plan compilation, broadcast variables, and result collection.
Executors: The workers. Each executor is a JVM process on a cluster node that: stores partitions in memory or on disk, executes tasks (transformation operations on individual partitions), and returns results to the driver. More executors = more parallelism. Each executor has: CPU cores (number of tasks it can run simultaneously) and memory (for caching data and computation). The executor count × cores per executor = total parallelism.
The DAG (Directed Acyclic Graph): Spark doesn't execute transformations immediately. It builds a DAG of all operations — reads, filters, joins, aggregations — and optimizes the execution plan before running anything. The optimizer (Catalyst for SQL, Tungsten for memory) reorders operations, eliminates unnecessary steps, and selects efficient physical plans. Understanding the DAG is how you debug slow Spark jobs: the Spark UI shows the DAG, stage boundaries, and task-level metrics that reveal bottlenecks.
Cluster Sizing: How Many Nodes, How Much Memory
Cluster sizing balances: enough parallelism for your workload, enough memory to avoid disk spill, and not so much that you're paying for idle capacity.
| Workload | Nodes | Cores/Node | Memory/Node | Rationale |
|---|---|---|---|---|
| ETL (10-50GB) | 4-8 | 4-8 | 16-32GB | Moderate parallelism, minimal memory pressure |
| ETL (50-500GB) | 8-20 | 8-16 | 32-64GB | High parallelism, enough memory for large shuffles |
| ETL (500GB+) | 20-50+ | 16 | 64-128GB | Distributed across many nodes, large memory for joins |
| ML training | 8-20 (+ GPU nodes) | 8-16 | 64-128GB | Memory-intensive for feature matrices and model parameters |
| Interactive SQL | 4-12 | 8-16 | 32-64GB | Fast startup, responsive to ad-hoc queries |
The 5-core rule: Leave 1 core per node for the OS and Hadoop daemons. A 16-core node → configure 15 cores for Spark executors. Allocating all cores causes OS starvation and erratic performance.
Databricks and Fabric auto-scaling: Cloud Spark platforms auto-scale clusters based on workload. Set min and max nodes; the platform adds/removes nodes as processing demands change. Auto-scaling is ideal for variable workloads — the cluster grows for the nightly ETL and shrinks after completion. Configure the min to 0 for jobs that don't run 24/7 — the cluster terminates completely between runs, eliminating idle cost.
Partitioning Strategy: The Key to Parallel Performance
Spark processes data in partitions — each partition is processed by one task on one core. The number of partitions determines the degree of parallelism: too few partitions → some cores sit idle (under-utilized cluster), too many partitions → scheduling overhead exceeds processing time (diminished returns), and skewed partitions → one task processes 80% of the data while 19 tasks finish in seconds (data skew problem). Target: 2-4 partitions per available core, with each partition sized 128-256MB.
Repartitioning: When partition count or distribution is wrong, explicitly repartition: df.repartition(200) (redistribute into 200 even partitions) or df.repartition("date_column") (partition by a specific column for optimized downstream joins). Repartitioning causes a full shuffle — expensive, but necessary when the current partitioning is causing skew or suboptimal parallelism.
Coalesce vs. repartition: After filtering removes 90% of rows, the remaining data is spread across many near-empty partitions. df.coalesce(10) reduces partitions without a full shuffle (combines adjacent partitions). Use coalesce when reducing partitions; use repartition when increasing or rebalancing.
The Shuffle Problem: Why Joins and GroupBys Are Expensive
Shuffle is the most expensive Spark operation. It occurs when data must move between nodes — during joins, groupBy, distinct, and sort operations. Shuffle involves: serializing data on the source node, writing to disk, transferring over the network to the target node, reading from disk, and deserializing. For a 100GB dataset, a shuffle can transfer 100GB across the network — dominating job execution time.
Broadcast joins: When joining a large table with a small table (under 8-10GB), broadcast the small table to every node. Each node holds the small table in memory and joins locally — no shuffle required. df.join(broadcast(small_df), "key") eliminates the shuffle for this join. Broadcast joins are the single most impactful Spark optimization for join-heavy workloads.
Partition-aligned joins: If both tables are partitioned by the join key with the same number of partitions, Spark performs a co-located join — matching partitions are already on the same node. No shuffle required. Pre-partition frequently joined tables by their join key to enable this optimization.
Memory Management: When Spark Spills to Disk
Spark executor memory is divided into: storage memory (cached DataFrames, broadcast variables), execution memory (shuffles, joins, aggregations), and overhead (JVM objects, internal metadata). When execution memory is insufficient for a shuffle or join, Spark spills to disk — writing intermediate data to local storage. Disk I/O is 10-100x slower than memory access; spilling degrades performance dramatically. Signs of spilling: "Spill (Disk)" metrics in the Spark UI, job stages taking 10x longer than expected, and executor disk space warnings. Fix: increase executor memory, reduce shuffle size (through broadcast joins or pre-aggregation), or repartition to reduce per-partition data size.
7 Performance Tuning Techniques
Filter Early, Transform Late
Apply filters as early as possible in the pipeline — reduce data volume before expensive operations (joins, aggregations). A filter that removes 80% of rows before a join reduces shuffle volume by 80%.
Use Broadcast Joins for Small Tables
Any join where one side is under 8-10GB: broadcast it. Eliminates shuffle entirely. Single biggest performance improvement for join-heavy ETL.
Optimize File Formats: Parquet + Delta
Parquet (columnar, compressed) is 10-100x more efficient than CSV for analytical queries. Delta Lake adds: predicate pushdown (skip files that don't contain matching data), Z-ordering (co-locate related data physically), and data skipping (metadata-based file pruning). Read only the columns and rows you need.
Avoid UDFs — Use Built-in Functions
Python UDFs are 10-100x slower than built-in Spark functions because they require serialization between JVM and Python. Replace UDFs with Spark SQL functions (when, coalesce, regexp_replace, etc.) wherever possible.
Cache Strategically
Cache DataFrames that are reused multiple times in the same job — but don't cache everything (memory is finite). Unpersist cached data when no longer needed.
Address Data Skew
When one partition is 100x larger than others: salting (add random prefix to the skewed key, join with the same prefix, then aggregate), or separate processing (handle skewed keys in a dedicated path).
Right-Size Partitions
Target 128-256MB per partition. Repartition after filters that dramatically change data volume. Use coalesce before writing to avoid many small output files.
Spark + Delta Lake: Optimized Storage Access
Delta Lake on Fabric and Databricks provides storage optimizations that reduce Spark processing time: data skipping (Delta maintains min/max statistics per file per column — a query for "date = 2024-03-15" skips all files where the date range doesn't include March 15), Z-ordering (physically co-locate data that's frequently filtered together — Z-order by customer_id makes customer-specific queries scan fewer files), and auto-optimize (automatically compact small files into optimally-sized files — preventing the "small file problem" that degrades Spark read performance). These optimizations are transparent — the data engineer writes standard Spark code; Delta Lake optimizes the storage access automatically.
OPTIMIZE and VACUUM: Run OPTIMIZE periodically to compact files and apply Z-ordering. Run VACUUM to delete obsolete files (old versions retained for time travel). These maintenance operations keep Delta tables performing optimally over time — without them, file fragmentation degrades read performance progressively.
Spark Job Monitoring and Debugging
The Spark UI provides the diagnostic data for performance debugging: Jobs tab (which jobs ran, how long each took), Stages tab (which stages within each job, task counts, shuffle data volumes), SQL tab (physical execution plan — which operators executed, rows processed per operator), and Storage tab (cached DataFrames, memory usage). The debugging workflow: identify the slowest stage → check task distribution (is one task 100x slower than others? → data skew) → check shuffle volumes (is a join shuffling 100GB? → broadcast instead) → check spill metrics (is data spilling to disk? → increase memory or reduce per-partition size). Every Spark performance issue traces back to: skew, shuffle, or spill. The Spark UI shows which one.
Spark on Fabric vs Spark on Databricks
Fabric Spark and Databricks Spark both run Apache Spark — but with different optimizations and management models. Fabric Spark: integrated with OneLake storage (zero-copy access to all Fabric data), shared capacity model (one CU allocation serves all Fabric workloads), Starter Pools for fast session startup, and native Power BI integration. Best for: Microsoft-ecosystem organizations using Fabric for BI and data engineering. Databricks Spark: Photon engine (C++ optimized, 2-8x faster for SQL workloads), Delta Live Tables (declarative pipeline management with auto-retry and quality constraints), Unity Catalog for fine-grained governance, and MLflow integration for ML workflows. Best for: ML-heavy workloads, multi-cloud deployments, and organizations needing the highest Spark performance. Both are "Spark" — but the surrounding platform determines which is more effective for your workload mix.
Common Spark Mistakes and How to Avoid Them
Collecting to driver: df.collect() pulls all data to the driver node — if the DataFrame has 500 million rows, the driver runs out of memory and crashes. Fix: use df.take(100) or df.show(20) for debugging; for production, write results to storage instead of collecting. Cartesian joins: A join without a join condition produces a cartesian product — 1 million rows x 1 million rows = 1 trillion rows. The job runs forever. Fix: always specify join conditions; use df.crossJoin() explicitly and only when intentional. Not caching reused DataFrames: If a DataFrame is used in 3 downstream operations, Spark recomputes it 3 times unless cached. Fix: df.cache() for reused DataFrames, df.unpersist() when done. Writing too many small files: A Spark job with 1,000 partitions writes 1,000 files — downstream reads of 1,000 small files are slower than reading 10 large files. Fix: df.coalesce(10) before writing.
The Xylity Approach
We optimize Spark workloads with the partition-shuffle-memory tuning framework — right-sized clusters, optimized partitioning, broadcast joins for small tables, Delta Lake storage optimization, and monitoring-driven debugging. Our data engineers and Databricks engineers tune Spark jobs that turn 4-hour processing into 12-minute runs — because in data engineering, performance is the difference between meeting and missing the business SLA.
Go Deeper
Continue building your understanding with these related resources from our consulting practice.
Spark Performance That Meets the SLA
Cluster sizing, partitioning, broadcast joins, Delta optimization. Spark tuning that turns hours into minutes.
Start Your Spark Optimization →