In the previous article, I described how to split a single column into multiple columns. In this one, I will show you how to do the opposite and merge multiple columns into one column.
Table of Contents
Suppose that I have the following DataFrame, and I would like to create a column that contains the values from both of those columns with a single space in between:
+--------+--------+
|column_A|column_B|
+--------+--------+
| val_A_1| val_A_2|
| val_B_1| val_B_2|
| null| val_B_3|
| val_A_4| null|
| null| null|
+--------+--------+
In Spark, we can do that in two ways that give us a slightly different result.
concat
The first method is to use the concat
function:
df.withColumn('joined', concat(col('column_A'), lit(' '), col('column_B')))
+--------+--------+---------------+
|column_A|column_B| joined|
+--------+--------+---------------+
| val_A_1| val_A_2|val_A_1 val_A_2|
| val_B_1| val_B_2|val_B_1 val_B_2|
| null| val_B_3| null|
| val_A_4| null| null|
| null| null| null|
+--------+--------+---------------+
Want to build AI systems that actually work?
Download my expert-crafted GenAI Transformation Guide for Data Teams and discover how to properly measure AI performance, set up guardrails, and continuously improve your AI solutions like the pros.
concat_ws
We see that if any of the values is null, we will get null as a result. What if we prefer to ignore the null values and concatenate the remaining columns? Of course, we could use the nvl
function to replace nulls with empty strings or the when
function to build conditional expressions, but there is an easier method.
To eliminate the null values without breaking the concatenation, we can use the concat_ws
function. That function works a little bit differently than the concat
. As the first parameter, we must pass the separator that it will put between all of the columns. After that, we specify the columns to merge:
df.withColumn('joined', concat_ws(' ', col('column_A'), col('column_B')))
As a result, we get the following DataFrame:
+--------+--------+---------------+
|column_A|column_B| joined|
+--------+--------+---------------+
| val_A_1| val_A_2|val_A_1 val_A_2|
| val_B_1| val_B_2|val_B_1 val_B_2|
| null| val_B_3| val_B_3|
| val_A_4| null| val_A_4|
| null| null| |
+--------+--------+---------------+
Note that the last row contains an empty string, not a string with a single space inside! I can prove that using the length
function:
df \
.withColumn('joined', concat_ws(' ', col('column_A'), col('column_B'))) \
.withColumn('length', length(col('joined')))
+--------+--------+---------------+------+
|column_A|column_B| joined|length|
+--------+--------+---------------+------+
| val_A_1| val_A_2|val_A_1 val_A_2| 15|
| val_B_1| val_B_2|val_B_1 val_B_2| 15|
| null| val_B_3| val_B_3| 7|
| val_A_4| null| val_A_4| 7|
| null| null| | 0|
+--------+--------+---------------+------+
The Wrong Way
To show that using the when
function is a terrible idea, let’s try to achieve the same outcome as the concat_ws
function.
First, I have to check whether a column is null and return an empty string in such a case:
when(col('column_A').isNotNull(), col('column_A')).otherwise(lit(''))
when(col('column_B').isNotNull(), col('column_B')).otherwise(lit(''))
In addition to all of that code, I have to check whether both columns are not null. If any of them is null, I will have to use an empty string instead of space as the second argument:
when((col('column_A').isNotNull()) & (col('column_B').isNotNull()), lit(' ')).otherwise(lit(''))
The full code looks like this:
df \
.withColumn('null_to_blank', concat(
when(col('column_A').isNotNull(), col('column_A')).otherwise(lit('')),
when((col('column_A').isNotNull()) & (col('column_B').isNotNull()), lit(' ')).otherwise(lit('')),
when(col('column_B').isNotNull(), col('column_B')).otherwise(lit(''))
))
Do not concatenate columns like this. Use concat_ws instead.