It does not happen very often, but sometimes when we group rows by a column, we don’t want to aggregate the data. Occasionally, we want to get all elements of another column as a list.

Table of Contents

  1. Removing duplicates

Imagine that I have the following table with two columns:

C1  C2
AAA AA1
AAA AA2
AAA AA2
BBB BB1
BBB BB2
CCC CC1

and I would like to get the following result:

AAA [AA1, AA2, AA2]
BBB [BB1, BB2]
CCC [CC1]

In Hive, we can achieve that result when we group by the C1 column and use the the collect_list function to combine the data from the C2 column:

SELECT C1, collect_list(C2)
FROM some_table
GROUP BY C1

Removing duplicates

You have probably noticed that the AA2 value occurs twice in my input dataset and the output. If I wanted to remove the duplicates, I should substitute the collect_list function with the collect_set function:

SELECT C1, collect_set(C2)
FROM some_table
GROUP BY C1

In this case, the output will look like this:

AAA [AA1, AA2]
BBB [BB1, BB2]
CCC [CC1]
Subscribe to the newsletter
Now Enrolling: A new cohort for my premium course on fixing AI hallucinations. Limited 'Founding Member' spots available. Learn more