This article will show you how to extract the struct field and convert them into separate columns in a Spark DataFrame.
Let’s assume that I have the following DataFrame, and the to_be_flattened column contains a struct with two fields:
+-------------------+
| to_be_flattened|
+-------------------+
| [1183, Amsterdam]|
| [06123, Ankara]|
| [08067, Barcelona]|
| [3030, Bern]|
| [75116, Paris]|
| [1149-014, Lisbon]|
| [00-999, Warsaw]|
| [00199, Rome]|
|[HR-10 040, Zagreb]|
+-------------------+
Extracting those fields into columns is trivial, and we need only this line of code to achieve it:
df.select(col('to_be_flattened.*'))
As a result, we get this DataFrame:
+-----------+---------+
|postal_code| city|
+-----------+---------+
| 1183|Amsterdam|
| 06123| Ankara|
| 08067|Barcelona|
| 3030| Bern|
| 75116| Paris|
| 1149-014| Lisbon|
| 00-999| Warsaw|
| 00199| Rome|
| HR-10 040| Zagreb|
+-----------+---------+
We have lost the original column name. What if I wanted to prefix the extracted columns with its previous name, and instead of postal_code and city have columns to_be_flattened_postal_code and to_be_flattened_city?
We can do it by getting the field names from the struct schema, iterating over them, and adding the prefix to every field:
df.select(col('to_be_flattened.*')) \
.select([col(c).alias('to_be_flattened_' + c) for c in struct_schema.fieldNames()])
+---------------------------+--------------------+
|to_be_flattened_postal_code|to_be_flattened_city|
+---------------------------+--------------------+
| 1183| Amsterdam|
| 06123| Ankara|
| 08067| Barcelona|
| 3030| Bern|
| 75116| Paris|
| 1149-014| Lisbon|
| 00-999| Warsaw|
| 00199| Rome|
| HR-10 040| Zagreb|
+---------------------------+--------------------+