---
title: "How to concatenate columns in a PySpark DataFrame"
description: "How to use the concat and concat_ws functions to merge multiple columns into one in PySpark"
author: "Bartosz Mikulski"
author_bio: "Principal AI Engineer & MLOps Architect. I bridge the gap between \"it works in a notebook\" and \"it works for 200 million users.\""
author_url: https://mikulskibartosz.name
author_linkedin: https://www.linkedin.com/in/mikulskibartosz/
author_github: https://github.com/mikulskibartosz
canonical_url: https://mikulskibartosz.name/how-to-concatenate-columns-in-pyspark-dataframe
---

In the [previous article](https://mikulskibartosz.name/derive-multiple-columns-from-single-column-in-pyspark/), I described how to split a single column into multiple columns. In this one, I will show you how to do the opposite and merge multiple columns into one column.

Suppose that I have the following DataFrame, and I would like to create a column that contains the values from both of those columns with a single space in between:

```
+--------+--------+
|column_A|column_B|
+--------+--------+
| val_A_1| val_A_2|
| val_B_1| val_B_2|
|    null| val_B_3|
| val_A_4|    null|
|    null|    null|
+--------+--------+
```

In Spark, we can do that in two ways that give us a slightly different result.

## concat

The first method is to use the `concat` function:

```python
df.withColumn('joined', concat(col('column_A'), lit(' '), col('column_B')))
```

```
+--------+--------+---------------+
|column_A|column_B|         joined|
+--------+--------+---------------+
| val_A_1| val_A_2|val_A_1 val_A_2|
| val_B_1| val_B_2|val_B_1 val_B_2|
|    null| val_B_3|           null|
| val_A_4|    null|           null|
|    null|    null|           null|
+--------+--------+---------------+
```

## concat_ws

We see that if any of the values is null, we will get null as a result. What if we prefer to ignore the null values and concatenate the remaining columns? Of course, we could use the `nvl` function to replace nulls with empty strings or the `when` function to build conditional expressions, but there is an easier method.

To eliminate the null values without breaking the concatenation, we can use the `concat_ws` function. That function works a little bit differently than the `concat`. As the first parameter, we must pass the separator that it will put between all of the columns. After that, we specify the columns to merge:

```python
df.withColumn('joined', concat_ws(' ', col('column_A'), col('column_B')))
```

As a result, we get the following DataFrame:

```
+--------+--------+---------------+
|column_A|column_B|         joined|
+--------+--------+---------------+
| val_A_1| val_A_2|val_A_1 val_A_2|
| val_B_1| val_B_2|val_B_1 val_B_2|
|    null| val_B_3|        val_B_3|
| val_A_4|    null|        val_A_4|
|    null|    null|               |
+--------+--------+---------------+
```

Note that the last row contains an empty string, not a string with a single space inside! I can prove that using the `length` function:

```python
df \
        .withColumn('joined', concat_ws(' ', col('column_A'), col('column_B'))) \
        .withColumn('length', length(col('joined')))
```

```
+--------+--------+---------------+------+
|column_A|column_B|         joined|length|
+--------+--------+---------------+------+
| val_A_1| val_A_2|val_A_1 val_A_2|    15|
| val_B_1| val_B_2|val_B_1 val_B_2|    15|
|    null| val_B_3|        val_B_3|     7|
| val_A_4|    null|        val_A_4|     7|
|    null|    null|               |     0|
+--------+--------+---------------+------+
```

## The Wrong Way

To show that using the `when` function is a terrible idea, let's try to achieve the same outcome as the `concat_ws` function.

First, I have to check whether a column is null and return an empty string in such a case:

```python
when(col('column_A').isNotNull(), col('column_A')).otherwise(lit(''))
when(col('column_B').isNotNull(), col('column_B')).otherwise(lit(''))
```

In addition to all of that code, I have to check whether both columns are not null. If any of them is null, I will have to use an empty string instead of space as the second argument:

```python
when((col('column_A').isNotNull()) & (col('column_B').isNotNull()), lit(' ')).otherwise(lit(''))
```

The full code looks like this:

```python
df \
    .withColumn('null_to_blank', concat(
        when(col('column_A').isNotNull(), col('column_A')).otherwise(lit('')),
        when((col('column_A').isNotNull()) & (col('column_B').isNotNull()), lit(' ')).otherwise(lit('')),
        when(col('column_B').isNotNull(), col('column_B')).otherwise(lit(''))
    ))
```

**Do not concatenate columns like this. Use concat_ws instead.**