---
title: "What is the difference between CUBE and ROLLUP and how to use it in Apache Spark?"
description: "How to use the cube and rollup functions in Apache Spark or PySpark. What is the difference between a cube and a rollup."
author: "Bartosz Mikulski"
author_bio: "Principal AI Engineer & MLOps Architect. I bridge the gap between \"it works in a notebook\" and \"it works for 200 million users.\""
author_url: https://mikulskibartosz.name
author_linkedin: https://www.linkedin.com/in/mikulskibartosz/
author_github: https://github.com/mikulskibartosz
canonical_url: https://mikulskibartosz.name/cube-and-rollup-in-apache-spark
---

In this article, I will explain the difference between the `cube` and `rollup` operators, and show how to use them in Apache Spark (or PySpark).

Both functions are used to group data by multiple columns and calculate an aggregation of another column. `cube` creates combinations of all values in all listed columns. `rollup` returns a hierarchy of values using the given columns starting from the first given column.

For example, if I was grouping values by columns `year`, `month`, and `day` to calculate the number of rows, `cube` would return the following combinations. In the second column, I explain what the corresponding `group by` statement is.

| CUBE | GROUP BY|
|---|---|
| year, month, day | SELECT COUNT(*) FROM table GROUP BY year, month, day |
| year, month | SELECT COUNT(*) FROM table GROUP BY year, month |
| year, day | SELECT COUNT(*) FROM table GROUP BY year, day |
| year | SELECT COUNT(*) FROM table GROUP BY year |
| month, day | SELECT COUNT(*) FROM table GROUP BY month, day |
| month | SELECT COUNT(*) FROM table GROUP BY month |
| day | SELECT COUNT(*) FROM table GROUP BY day |
| null, null, null | SELECT COUNT(*) FROM table |

If I used the `rollup` function, the grouping would look like this:

| ROLLUP | GROUP BY |
|---|---|
| year, month, day | SELECT COUNT(*) FROM table GROUP BY year, month, day |
| year, month | SELECT COUNT(*) FROM table GROUP BY year, month |
| year | SELECT COUNT(*) FROM table GROUP BY year |
| null | SELECT COUNT(*) FROM table |

## Example in PySpark

Suppose that I have a DataFrame representing sales of financial products that contain the product category, the name of the salesperson, and the number of products sold.

```
+---------------+---------+--------+
|       category|     name|how_many|
+---------------+---------+--------+
|      insurance|   Janusz|       0|
|savings account|  Grażyna|       1|
|    credit card|Sebastian|       0|
|       mortgage|   Janusz|       2|
|   term deposit|   Janusz|       4|
|      insurance|  Grażyna|       2|
|savings account|   Janusz|       5|
|    credit card|Sebastian|       2|
|       mortgage|Sebastian|       4|
|   term deposit|   Janusz|       9|
|      insurance|  Grażyna|       3|
|savings account|  Grażyna|       1|
|savings account|Sebastian|       0|
|savings account|Sebastian|       2|
|    credit card|Sebastian|       1|
+---------------+---------+--------+
```

When I use the `cube` function to calculate the sum of sales grouped by category and name, the code looks like this:

```python
df.cube('category', 'name').agg(sum('how_many'))
```

As a result, I get a DataFrame of all combinations of the category and the name. It also includes the combinations where any (or both) of those columns are not defined:

```
+---------------+---------+-------------+
|       category|     name|sum(how_many)|
+---------------+---------+-------------+
|           null|  Grażyna|            7|
|       mortgage|     null|            6|
|           null|     null|           36|
|      insurance|     null|            5|
|savings account|  Grażyna|            2|
|    credit card|Sebastian|            3|
|   term deposit|     null|           13|
|      insurance|  Grażyna|            5|
|           null|Sebastian|            9|
|   term deposit|   Janusz|           13|
|savings account|     null|            9|
|      insurance|   Janusz|            0|
|       mortgage|Sebastian|            4|
|savings account|   Janusz|            5|
|       mortgage|   Janusz|            2|
|savings account|Sebastian|            2|
|    credit card|     null|            3|
|           null|   Janusz|           20|
+---------------+---------+-------------+
```

When I am interested in sales grouped primarily by the category, but I also want to know who is the best salesperson in each category, I should use the `rollup` function:

```python
df.rollup('category', 'name').agg(sum('how_many'))
```

```
+---------------+---------+-------------+
|       category|     name|sum(how_many)|
+---------------+---------+-------------+
|       mortgage|     null|            6|
|           null|     null|           36|
|      insurance|     null|            5|
|savings account|  Grażyna|            2|
|    credit card|Sebastian|            3|
|   term deposit|     null|           13|
|      insurance|  Grażyna|            5|
|   term deposit|   Janusz|           13|
|savings account|     null|            9|
|      insurance|   Janusz|            0|
|       mortgage|Sebastian|            4|
|savings account|   Janusz|            5|
|       mortgage|   Janusz|            2|
|savings account|Sebastian|            2|
|    credit card|     null|            3|
+---------------+---------+-------------+
```