What is the difference between repartition and coalesce in Apache Spark?

In Spark, there are two ways of explicitly changing the number of partitions. We can use either the repartition function or coalesce. Both accept a numeric parameter, which tells Spark how many partitions we want to have. I will explain the difference between those two functions and when we should use them.

Table of Contents

Repartition
Coalesce
When to Use Coalesce Instead of Repartition?
Is Coalesce Worth Consideration In Practice?

Repartition

First of all, the repartition function that performs operations on Spark Datasets comes in three versions. There is a function that accepts only the number of partitions we want to get. In this case, Spark performs RoundRobinPartitioning to uniformly distribute the whole Dataset across the requested number of partitions.

If we specify both the expression and the number of partitions, Spark does hashpartitioning, which produces the requested number of partitions in which data is partitioned using the given expression.

Note that, the result will most likely not be uniformly distributed. We may end up with partitions that contain significantly more data if the partitioning value is skewed.

In addition to that, we may use the function that accepts only the partitioning expression and uses the sql.shuffle.partitions parameter as the number of partitions.

We see that the repartition function can both increase and decrease the number of partitions. Suppose we do not use the partitioning expression. In that case, we get partitions that distribute the workload equally across all worker nodes (unless we use another operation that implicitly repartitions the data using hashpartitioning).

Coalesce

Coalesce is a little bit different. It accepts only one parameter - there is no way to use the partitioning expression, and it can only decrease the number of partitions. It works this way because we should use coalesce only to combine the existing partitions. It merges the data by draining existing partitions into others and removing the empty partitions.

Note that if we specify the number of partitions larger than the current partitioning, coalesce will not do anything.

When to Use Coalesce Instead of Repartition?

According to the documentation, it is better to run repartition instead of coalesce when we want to do “drastic coalesce” (for example, merge everything into one partition). When we do it, the upstream calculations execute in parallel, so the entire computation is distributed across a larger number of workers.

I can think of only one use case when I am sure that the coalesce function is a better choice. If I have just used where or filter, and I think that some of the partitions may be “almost empty”, I will want to move the data around to have a similar number of rows in every partition. coalesce will not give me uniformed distribution, but it is going to be faster then repartition, and it should be good enough in this case.

Is Coalesce Worth Consideration In Practice?

Most likely no. Usually, the Spark jobs are full of the group by and join operations, so avoiding full reshuffles using coalesce in a few places will not make a huge difference.

What is the difference between repartition and coalesce in Apache Spark?

Repartition

Coalesce

When to Use Coalesce Instead of Repartition?

Is Coalesce Worth Consideration In Practice?

How to pivot an Apache Spark DataFrame

How to make a pivot table in AWS Athena or PrestoSQL

What is the difference between repartition and coalesce in Apache Spark?

Repartition

Coalesce

When to Use Coalesce Instead of Repartition?

Is Coalesce Worth Consideration In Practice?

How to pivot an Apache Spark DataFrame

How to make a pivot table in AWS Athena or PrestoSQL

Related Posts

What is shuffling in Apache Spark, and when does it happen?

How to measure Spark performance and gather metrics about written data

How to combine two DataFrames with no common columns in Apache Spark