In this article, I will explain the difference between the cube
and rollup
operators, and show how to use them in Apache Spark (or PySpark).
Table of Contents
Both functions are used to group data by multiple columns and calculate an aggregation of another column. cube
creates combinations of all values in all listed columns. rollup
returns a hierarchy of values using the given columns starting from the first given column.
For example, if I was grouping values by columns year
, month
, and day
to calculate the number of rows, cube
would return the following combinations. In the second column, I explain what the corresponding group by
statement is.
CUBE | GROUP BY |
---|---|
year, month, day | SELECT COUNT(*) FROM table GROUP BY year, month, day |
year, month | SELECT COUNT(*) FROM table GROUP BY year, month |
year, day | SELECT COUNT(*) FROM table GROUP BY year, day |
year | SELECT COUNT(*) FROM table GROUP BY year |
month, day | SELECT COUNT(*) FROM table GROUP BY month, day |
month | SELECT COUNT(*) FROM table GROUP BY month |
day | SELECT COUNT(*) FROM table GROUP BY day |
null, null, null | SELECT COUNT(*) FROM table |
If I used the rollup
function, the grouping would look like this:
ROLLUP | GROUP BY |
---|---|
year, month, day | SELECT COUNT(*) FROM table GROUP BY year, month, day |
year, month | SELECT COUNT(*) FROM table GROUP BY year, month |
year | SELECT COUNT(*) FROM table GROUP BY year |
null | SELECT COUNT(*) FROM table |
Want to build AI systems that actually work?
Download my expert-crafted GenAI Transformation Guide for Data Teams and discover how to properly measure AI performance, set up guardrails, and continuously improve your AI solutions like the pros.
Example in PySpark
Suppose that I have a DataFrame representing sales of financial products that contain the product category, the name of the salesperson, and the number of products sold.
+---------------+---------+--------+
| category| name|how_many|
+---------------+---------+--------+
| insurance| Janusz| 0|
|savings account| Grażyna| 1|
| credit card|Sebastian| 0|
| mortgage| Janusz| 2|
| term deposit| Janusz| 4|
| insurance| Grażyna| 2|
|savings account| Janusz| 5|
| credit card|Sebastian| 2|
| mortgage|Sebastian| 4|
| term deposit| Janusz| 9|
| insurance| Grażyna| 3|
|savings account| Grażyna| 1|
|savings account|Sebastian| 0|
|savings account|Sebastian| 2|
| credit card|Sebastian| 1|
+---------------+---------+--------+
When I use the cube
function to calculate the sum of sales grouped by category and name, the code looks like this:
df.cube('category', 'name').agg(sum('how_many'))
As a result, I get a DataFrame of all combinations of the category and the name. It also includes the combinations where any (or both) of those columns are not defined:
+---------------+---------+-------------+
| category| name|sum(how_many)|
+---------------+---------+-------------+
| null| Grażyna| 7|
| mortgage| null| 6|
| null| null| 36|
| insurance| null| 5|
|savings account| Grażyna| 2|
| credit card|Sebastian| 3|
| term deposit| null| 13|
| insurance| Grażyna| 5|
| null|Sebastian| 9|
| term deposit| Janusz| 13|
|savings account| null| 9|
| insurance| Janusz| 0|
| mortgage|Sebastian| 4|
|savings account| Janusz| 5|
| mortgage| Janusz| 2|
|savings account|Sebastian| 2|
| credit card| null| 3|
| null| Janusz| 20|
+---------------+---------+-------------+
When I am interested in sales grouped primarily by the category, but I also want to know who is the best salesperson in each category, I should use the rollup
function:
df.rollup('category', 'name').agg(sum('how_many'))
+---------------+---------+-------------+
| category| name|sum(how_many)|
+---------------+---------+-------------+
| mortgage| null| 6|
| null| null| 36|
| insurance| null| 5|
|savings account| Grażyna| 2|
| credit card|Sebastian| 3|
| term deposit| null| 13|
| insurance| Grażyna| 5|
| term deposit| Janusz| 13|
|savings account| null| 9|
| insurance| Janusz| 0|
| mortgage|Sebastian| 4|
|savings account| Janusz| 5|
| mortgage| Janusz| 2|
|savings account|Sebastian| 2|
| credit card| null| 3|
+---------------+---------+-------------+