It does not happen very often, but sometimes when we group rows by a column, we don’t want to aggregate the data. Occasionally, we want to get all elements of another column as a list.
Table of Contents
Imagine that I have the following table with two columns:
C1 C2
AAA AA1
AAA AA2
AAA AA2
BBB BB1
BBB BB2
CCC CC1
and I would like to get the following result:
AAA [AA1, AA2, AA2]
BBB [BB1, BB2]
CCC [CC1]
In Hive, we can achieve that result when we group by the C1 column and use the the collect_list
function to combine the data from the C2 column:
SELECT C1, collect_list(C2)
FROM some_table
GROUP BY C1
Want to build AI systems that actually work?
Download my expert-crafted GenAI Transformation Guide for Data Teams and discover how to properly measure AI performance, set up guardrails, and continuously improve your AI solutions like the pros.
Removing duplicates
You have probably noticed that the AA2 value occurs twice in my input dataset and the output. If I wanted to remove the duplicates, I should substitute the collect_list
function with the collect_set
function:
SELECT C1, collect_set(C2)
FROM some_table
GROUP BY C1
In this case, the output will look like this:
AAA [AA1, AA2]
BBB [BB1, BB2]
CCC [CC1]