---
title: "What is the difference between repartition and coalesce in Apache Spark?"
description: "When should you use coalesce instead of repartition in Apache Spark"
author: "Bartosz Mikulski"
author_bio: "Principal AI Engineer & MLOps Architect. I bridge the gap between \"it works in a notebook\" and \"it works for 200 million users.\""
author_url: https://mikulskibartosz.name
author_linkedin: https://www.linkedin.com/in/mikulskibartosz/
author_github: https://github.com/mikulskibartosz
canonical_url: https://mikulskibartosz.name/what-is-the-difference-between-repartition-and-coalesce-in-apache-spark
---

In Spark, there are two ways of explicitly changing the number of partitions. We can use either the `repartition` function or `coalesce`. Both accept a numeric parameter, which tells Spark how many partitions we want to have. I will explain the difference between those two functions and when we should use them.

## Repartition

First of all, the `repartition` function that performs operations on Spark `Datasets` comes in three versions. There is a function that accepts only the number of partitions we want to get. In this case, Spark performs `RoundRobinPartitioning` to uniformly distribute the whole `Dataset` across the requested number of partitions.

If we specify both the expression and the number of partitions, Spark does `hashpartitioning`, which produces the requested number of partitions in which data is partitioned using the given expression.

Note that, the result will most likely not be uniformly distributed. We may end up with partitions that contain significantly more data if the partitioning value is skewed.

In addition to that, we may use the function that accepts only the partitioning expression and uses the `sql.shuffle.partitions` parameter as the number of partitions.

We see that the `repartition` function can both increase and decrease the number of partitions. Suppose we do not use the partitioning expression. In that case, we get partitions that distribute the workload equally across all worker nodes (unless we use another operation that implicitly repartitions the data using `hashpartitioning`).

## Coalesce

Coalesce is a little bit different. It accepts only one parameter - there is no way to use the partitioning expression, and it can only decrease the number of partitions. It works this way because we should use `coalesce` only to combine the existing partitions. It merges the data by draining existing partitions into others and removing the empty partitions.

Note that if we specify the number of partitions larger than the current partitioning, coalesce will not do anything.

## When to Use Coalesce Instead of Repartition?

According to the documentation, it is better to run `repartition` instead of `coalesce` when we want to do “drastic coalesce” (for example, merge everything into one partition). When we do it, the upstream calculations execute in parallel, so the entire computation is distributed across a larger number of workers.

I can think of only one use case when I am sure that the coalesce function is a better choice. If I have just used `where` or `filter`, and I think that some of the partitions may be “almost empty”, I will want to move the data around to have a similar number of rows in every partition. `coalesce` will not give me uniformed distribution, but it is going to be faster then `repartition`, and it should be good enough in this case.

## Is Coalesce Worth Consideration In Practice?

Most likely no. Usually, the Spark jobs are full of the `group by` and `join` operations, so avoiding full reshuffles using coalesce in a few places will not make a huge difference.