It does not happen very often, but sometimes when we group rows by a column, we don’t want to aggregate the data. Occasionally, we want to get all elements of another column as a list.
Table of Contents
Imagine that I have the following table with two columns:
C1 C2
AAA AA1
AAA AA2
AAA AA2
BBB BB1
BBB BB2
CCC CC1
and I would like to get the following result:
AAA [AA1, AA2, AA2]
BBB [BB1, BB2]
CCC [CC1]
In Hive, we can achieve that result when we group by the C1 column and use the the collect_list
function to combine the data from the C2 column:
SELECT C1, collect_list(C2)
FROM some_table
GROUP BY C1
Get Weekly AI Implementation Insights
Join engineering leaders who receive my analysis of common AI production failures and how to prevent them. No fluff, just actionable techniques.
Removing duplicates
You have probably noticed that the AA2 value occurs twice in my input dataset and the output. If I wanted to remove the duplicates, I should substitute the collect_list
function with the collect_set
function:
SELECT C1, collect_set(C2)
FROM some_table
GROUP BY C1
In this case, the output will look like this:
AAA [AA1, AA2]
BBB [BB1, BB2]
CCC [CC1]