How to concatenate columns in a PySpark DataFrame

In the previous article, I described how to split a single column into multiple columns. In this one, I will show you how to do the opposite and merge multiple columns into one column.

Table of Contents

concat
concat_ws
The Wrong Way

Suppose that I have the following DataFrame, and I would like to create a column that contains the values from both of those columns with a single space in between:

+--------+--------+
|column_A|column_B|
+--------+--------+
| val_A_1| val_A_2|
| val_B_1| val_B_2|
|    null| val_B_3|
| val_A_4|    null|
|    null|    null|
+--------+--------+

In Spark, we can do that in two ways that give us a slightly different result.

concat

The first method is to use the concat function:

df.withColumn('joined', concat(col('column_A'), lit(' '), col('column_B')))

+--------+--------+---------------+
|column_A|column_B|         joined|
+--------+--------+---------------+
| val_A_1| val_A_2|val_A_1 val_A_2|
| val_B_1| val_B_2|val_B_1 val_B_2|
|    null| val_B_3|           null|
| val_A_4|    null|           null|
|    null|    null|           null|
+--------+--------+---------------+

concat_ws

We see that if any of the values is null, we will get null as a result. What if we prefer to ignore the null values and concatenate the remaining columns? Of course, we could use the nvl function to replace nulls with empty strings or the when function to build conditional expressions, but there is an easier method.

To eliminate the null values without breaking the concatenation, we can use the concat_ws function. That function works a little bit differently than the concat. As the first parameter, we must pass the separator that it will put between all of the columns. After that, we specify the columns to merge:

df.withColumn('joined', concat_ws(' ', col('column_A'), col('column_B')))

As a result, we get the following DataFrame:

+--------+--------+---------------+
|column_A|column_B|         joined|
+--------+--------+---------------+
| val_A_1| val_A_2|val_A_1 val_A_2|
| val_B_1| val_B_2|val_B_1 val_B_2|
|    null| val_B_3|        val_B_3|
| val_A_4|    null|        val_A_4|
|    null|    null|               |
+--------+--------+---------------+

Note that the last row contains an empty string, not a string with a single space inside! I can prove that using the length function:

df \
        .withColumn('joined', concat_ws(' ', col('column_A'), col('column_B'))) \
        .withColumn('length', length(col('joined')))

+--------+--------+---------------+------+
|column_A|column_B|         joined|length|
+--------+--------+---------------+------+
| val_A_1| val_A_2|val_A_1 val_A_2|    15|
| val_B_1| val_B_2|val_B_1 val_B_2|    15|
|    null| val_B_3|        val_B_3|     7|
| val_A_4|    null|        val_A_4|     7|
|    null|    null|               |     0|
+--------+--------+---------------+------+

The Wrong Way

To show that using the when function is a terrible idea, let’s try to achieve the same outcome as the concat_ws function.

First, I have to check whether a column is null and return an empty string in such a case:

when(col('column_A').isNotNull(), col('column_A')).otherwise(lit(''))
when(col('column_B').isNotNull(), col('column_B')).otherwise(lit(''))

In addition to all of that code, I have to check whether both columns are not null. If any of them is null, I will have to use an empty string instead of space as the second argument:

when((col('column_A').isNotNull()) & (col('column_B').isNotNull()), lit(' ')).otherwise(lit(''))

The full code looks like this:

df \
    .withColumn('null_to_blank', concat(
        when(col('column_A').isNotNull(), col('column_A')).otherwise(lit('')),
        when((col('column_A').isNotNull()) & (col('column_B').isNotNull()), lit(' ')).otherwise(lit('')),
        when(col('column_B').isNotNull(), col('column_B')).otherwise(lit(''))
    ))

Do not concatenate columns like this. Use concat_ws instead.

How to concatenate columns in a PySpark DataFrame

concat

concat_ws

The Wrong Way

How to derive multiple columns from a single column in a PySpark DataFrame

What is the difference between CUBE and ROLLUP and how to use it in Apache Spark?

How to concatenate columns in a PySpark DataFrame

concat

concat_ws

The Wrong Way

How to derive multiple columns from a single column in a PySpark DataFrame

What is the difference between CUBE and ROLLUP and how to use it in Apache Spark?

Related Posts

What is shuffling in Apache Spark, and when does it happen?

How to measure Spark performance and gather metrics about written data

Testing data products: BDD for data engineers