---
title: "How to get an array/bag of elements from the Hive group by operator?"
description: "How to get an array of elements from one column when grouping by another column in Hive"
author: "Bartosz Mikulski"
author_bio: "Principal AI Engineer & MLOps Architect. I bridge the gap between \"it works in a notebook\" and \"it works for 200 million users.\""
author_url: https://mikulskibartosz.name
author_linkedin: https://www.linkedin.com/in/mikulskibartosz/
author_github: https://github.com/mikulskibartosz
canonical_url: https://mikulskibartosz.name/hive-group-by-get-an-array
---

It does not happen very often, but sometimes when we group rows by a column, we don't want to aggregate the data. Occasionally, we want to get all elements of another column as a list.

Imagine that I have the following table with two columns:

```
C1  C2
AAA AA1
AAA AA2
AAA AA2
BBB BB1
BBB BB2
CCC CC1
```

and I would like to get the following result:

```
AAA [AA1, AA2, AA2]
BBB [BB1, BB2]
CCC [CC1]
```

In Hive, we can achieve that result when we group by the C1 column and use the the `collect_list` function to combine the data from the C2 column:

```sql
SELECT C1, collect_list(C2)
FROM some_table
GROUP BY C1
```

## Removing duplicates

You have probably noticed that the AA2 value occurs twice in my input dataset and the output. If I wanted to remove the duplicates, I should substitute the `collect_list` function with the `collect_set` function:

```sql
SELECT C1, collect_set(C2)
FROM some_table
GROUP BY C1
```

In this case, the output will look like this:

```
AAA [AA1, AA2]
BBB [BB1, BB2]
CCC [CC1]
```