In the previous article, I showed how to make a pivot DataFrame
using Apache Spark. This blog post shows the same operation in AWS Athena. Because Athena is built on top of PrestoDB, this method will also work in PrestoDB and PrestoSQL.
Table of Contents
Let’s use the same data as in the previous text. We have the following table that describes the sales of financial products. We have the category, the name of the salesperson, and the number of products sold.
+---------------+---------+--------+
| category| name|how_many|
+---------------+---------+--------+
| insurance| Janusz| 0|
|savings account| Grażyna| 1|
| credit card|Sebastian| 0|
| mortgage| Janusz| 2|
| term deposit| Janusz| 4|
| insurance| Grażyna| 2|
|savings account| Janusz| 5|
| credit card|Sebastian| 2|
| mortgage|Sebastian| 4|
| term deposit| Janusz| 9|
| insurance| Grażyna| 3|
|savings account| Grażyna| 1|
|savings account|Sebastian| 0|
|savings account|Sebastian| 2|
| credit card|Sebastian| 1|
+---------------+---------+--------+
We want to create a pivot table that contains the category, and separate columns for every person. The rows will contain the sum of products sold by each person in every product category.
There Is No Pivot Function in Athena
The first problem arises because Athena and PrestoSQL don’t have a PIVOT function. We need to use weird workaround. First, we have to group all salespeople into an array and store those arrays inside a map.
SELECT category, multimap_agg(name, how_many) as agg
FROM sales
GROUP BY category
This query creates an output which should look like this:
+---------------+-----------------------------------------------+
| category| agg |
+---------------+-----------------------------------------------+
| insurance| {Janusz=[0], Grażyna=[2, 3]}|
|savings account| {Janusz=[5], Grażyna=[1, 1], Sebastian=[0, 2]|
| credit card| {Sebastian=[0, 2, 1]}|
| mortgage| {Janusz=[2], Sebastian=[4]}|
| term deposit| {Janusz=[4, 9]|
+---------------+-----------------------------------------------+
Now, we have to use the reduce function to calculate the sum of every person’s sales. The previous SELECT
statement becomes a subquery of the following SQL:
SELECT
category,
reduce(agg['Grażyna'], 0.0, (s, x) -> s + x, s -> s) AS "Grażyna",
reduce(agg['Janusz'], 0.0, (s, x) -> s + x, s -> s) AS "Janusz",
reduce(agg['Sebastian'], 0.0, (s, x) -> s + x, s -> s) AS "Sebastian"
FROM (
SELECT category, multimap_agg(name, how_many) as agg
FROM sales
GROUP BY category
)
Finally, we get the result:
+---------------+-------+------+---------+
| category|Grażyna|Janusz|Sebastian|
+---------------+-------+------+---------+
|savings account| 2| 5| 2|
| term deposit| null| 13| null|
| mortgage| null| 2| 4|
| credit card| null| null| 3|
| insurance| 5| 0| null|
+---------------+-------+------+---------+
Want to build AI systems that actually work?
Download my expert-crafted GenAI Transformation Guide for Data Teams and discover how to properly measure AI performance, set up guardrails, and continuously improve your AI solutions like the pros.
Why There Is No Pivot Function
Unfortunately, it is not possible to automatically generate the column names, and we have to list all of the values we want and use the reduce function to extract them. In PrestoSQL, the schema must be available during query planning, so the analyzer can access all the information it needs.
In the case of the pivot function, the schema depends on the data in the table, so PrestoSQL (and Athena) would have to execute half of the query, look at the available values, and rewrite the execution plan of the subsequent operations. This limitation is also the reason why the pivot function does not exist.
For a detailed explanation, please take a look at the discussion on the PrestoSQL GitHub page: https://github.com/prestosql/presto/issues/1206