Hive

How to get an array/bag of elements from the Hive group by operator?

How to get an array of elements from one column when grouping by another column in Hive

Bartosz Mikulski 06 Oct 2020 – 1 min read

It does not happen very often, but sometimes when we group rows by a column, we don’t want to aggregate the data. Occasionally, we want to get all elements of another column as a list.

Table of Contents

Removing duplicates

Imagine that I have the following table with two columns:

C1  C2
AAA AA1
AAA AA2
AAA AA2
BBB BB1
BBB BB2
CCC CC1

and I would like to get the following result:

AAA [AA1, AA2, AA2]
BBB [BB1, BB2]
CCC [CC1]

In Hive, we can achieve that result when we group by the C1 column and use the the collect_list function to combine the data from the C2 column:

SELECT C1, collect_list(C2)
FROM some_table
GROUP BY C1

Removing duplicates

You have probably noticed that the AA2 value occurs twice in my input dataset and the output. If I wanted to remove the duplicates, I should substitute the collect_list function with the collect_set function:

SELECT C1, collect_set(C2)
FROM some_table
GROUP BY C1

In this case, the output will look like this:

AAA [AA1, AA2]
BBB [BB1, BB2]
CCC [CC1]

Older post

How to get an array/bag of elements from the Hive group by operator?

Removing duplicates

Working with dates and time in Apache Spark

How to concatenate multiple MySQL rows into a single field?

How to get an array/bag of elements from the Hive group by operator?

Removing duplicates

Working with dates and time in Apache Spark

How to concatenate multiple MySQL rows into a single field?

Related Posts

How to write Hive queries with column position number in the GROUP BY or ORDER BY clauses

How to check whether a regular expression matches a string in Hive

How to find the Hive partition closest to a given date