This article will show you how to extract the struct field and convert them into separate columns in a Spark DataFrame.

Table of Contents

  1. Get Weekly AI Implementation Insights

Let’s assume that I have the following DataFrame, and the to_be_flattened column contains a struct with two fields:

+-------------------+
|    to_be_flattened|
+-------------------+
|  [1183, Amsterdam]|
|    [06123, Ankara]|
| [08067, Barcelona]|
|       [3030, Bern]|
|     [75116, Paris]|
| [1149-014, Lisbon]|
|   [00-999, Warsaw]|
|      [00199, Rome]|
|[HR-10 040, Zagreb]|
+-------------------+

Extracting those fields into columns is trivial, and we need only this line of code to achieve it:

df.select(col('to_be_flattened.*'))

As a result, we get this DataFrame:

+-----------+---------+
|postal_code|     city|
+-----------+---------+
|       1183|Amsterdam|
|      06123|   Ankara|
|      08067|Barcelona|
|       3030|     Bern|
|      75116|    Paris|
|   1149-014|   Lisbon|
|     00-999|   Warsaw|
|      00199|     Rome|
|  HR-10 040|   Zagreb|
+-----------+---------+

Get Weekly AI Implementation Insights

Join engineering leaders who receive my analysis of common AI production failures and how to prevent them. No fluff, just actionable techniques.

We have lost the original column name. What if I wanted to prefix the extracted columns with its previous name, and instead of postal_code and city have columns to_be_flattened_postal_code and to_be_flattened_city?

We can do it by getting the field names from the struct schema, iterating over them, and adding the prefix to every field:

df.select(col('to_be_flattened.*')) \
    .select([col(c).alias('to_be_flattened_' + c) for c in struct_schema.fieldNames()])
+---------------------------+--------------------+
|to_be_flattened_postal_code|to_be_flattened_city|
+---------------------------+--------------------+
|                       1183|           Amsterdam|
|                      06123|              Ankara|
|                      08067|           Barcelona|
|                       3030|                Bern|
|                      75116|               Paris|
|                   1149-014|              Lisbon|
|                     00-999|              Warsaw|
|                      00199|                Rome|
|                  HR-10 040|              Zagreb|
+---------------------------+--------------------+

Get Weekly AI Implementation Insights

Join engineering leaders who receive my analysis of common AI production failures and how to prevent them. No fluff, just actionable techniques.

Older post

What is the difference between CUBE and ROLLUP and how to use it in Apache Spark?

Desc: How to use the cube and rollup functions in Apache Spark or PySpark. What is the difference between a cube and a rollup.

Newer post

When to cache an Apache Spark DataFrame?

Should we cache everything in Apache Spark, or are there any rules?

Engineering leaders: Is your AI failing in production? Take the 10-minute assessment
>