It does not happen very often, but sometimes when we group rows by a column, we don’t want to aggregate the data. Occasionally, we want to get all elements of another column as a list.
Table of Contents
Imagine that I have the following table with two columns:
C1 C2
AAA AA1
AAA AA2
AAA AA2
BBB BB1
BBB BB2
CCC CC1
and I would like to get the following result:
AAA [AA1, AA2, AA2]
BBB [BB1, BB2]
CCC [CC1]
In Hive, we can achieve that result when we group by the C1 column and use the the collect_list
function to combine the data from the C2 column:
SELECT C1, collect_list(C2)
FROM some_table
GROUP BY C1
Removing duplicates
You have probably noticed that the AA2 value occurs twice in my input dataset and the output. If I wanted to remove the duplicates, I should substitute the collect_list
function with the collect_set
function:
SELECT C1, collect_set(C2)
FROM some_table
GROUP BY C1
In this case, the output will look like this:
AAA [AA1, AA2]
BBB [BB1, BB2]
CCC [CC1]