In Spark, there are two ways of explicitly changing the number of partitions. We can use either the repartition
function or coalesce
. Both accept a numeric parameter, which tells Spark how many partitions we want to have. I will explain the difference between those two functions and when we should use them.
Table of Contents
- Repartition
- Coalesce
- When to Use Coalesce Instead of Repartition?
- Is Coalesce Worth Consideration In Practice?
Repartition
First of all, the repartition
function that performs operations on Spark Datasets
comes in three versions. There is a function that accepts only the number of partitions we want to get. In this case, Spark performs RoundRobinPartitioning
to uniformly distribute the whole Dataset
across the requested number of partitions.
If we specify both the expression and the number of partitions, Spark does hashpartitioning
, which produces the requested number of partitions in which data is partitioned using the given expression.
Note that, the result will most likely not be uniformly distributed. We may end up with partitions that contain significantly more data if the partitioning value is skewed.
In addition to that, we may use the function that accepts only the partitioning expression and uses the sql.shuffle.partitions
parameter as the number of partitions.
We see that the repartition
function can both increase and decrease the number of partitions. Suppose we do not use the partitioning expression. In that case, we get partitions that distribute the workload equally across all worker nodes (unless we use another operation that implicitly repartitions the data using hashpartitioning
).
Want to build AI systems that actually work?
Download my expert-crafted GenAI Transformation Guide for Data Teams and discover how to properly measure AI performance, set up guardrails, and continuously improve your AI solutions like the pros.
Coalesce
Coalesce is a little bit different. It accepts only one parameter - there is no way to use the partitioning expression, and it can only decrease the number of partitions. It works this way because we should use coalesce
only to combine the existing partitions. It merges the data by draining existing partitions into others and removing the empty partitions.
Note that if we specify the number of partitions larger than the current partitioning, coalesce will not do anything.
When to Use Coalesce Instead of Repartition?
According to the documentation, it is better to run repartition
instead of coalesce
when we want to do “drastic coalesce” (for example, merge everything into one partition). When we do it, the upstream calculations execute in parallel, so the entire computation is distributed across a larger number of workers.
I can think of only one use case when I am sure that the coalesce function is a better choice. If I have just used where
or filter
, and I think that some of the partitions may be “almost empty”, I will want to move the data around to have a similar number of rows in every partition. coalesce
will not give me uniformed distribution, but it is going to be faster then repartition
, and it should be good enough in this case.
Is Coalesce Worth Consideration In Practice?
Most likely no. Usually, the Spark jobs are full of the group by
and join
operations, so avoiding full reshuffles using coalesce in a few places will not make a huge difference.