---
title: "Use regexp_replace to replace a matched string with a value of another column in PySpark"
description: "Use regex to replace the matched string with the content of another column in PySpark"
author: "Bartosz Mikulski"
author_bio: "Principal AI Engineer & MLOps Architect. I bridge the gap between \"it works in a notebook\" and \"it works for 200 million users.\""
author_url: https://mikulskibartosz.name
author_linkedin: https://www.linkedin.com/in/mikulskibartosz/
author_github: https://github.com/mikulskibartosz
canonical_url: https://mikulskibartosz.name/replace-matched-string-with-another-column-in-pyspark
---

When we look at the documentation of `regexp_replace`, we see that it accepts three parameters:

* the name of the column
* the regular expression
* the replacement text

Unfortunately, we cannot specify the column name as the third parameter and use the column value as the replacement.

If I have the following DataFrame and use the regex_replace function to substitute the numbers with the content of the `b_column`:

```python
from pyspark.sql import DataFrame

from pyspark.sql.types import *
from pyspark.sql import functions as F

df_schema = StructType([StructField('a_column', StringType()), StructField('b_column', StringType())])

test_list = [
    ['aaa123', 'aaa'],
    ['bbb243', 'bbb']
]

df = spark_session.createDataFrame(test_list, schema=df_schema)

df \
    .withColumn('replaced', F.regexp_replace('a_column', '\d{3}', 'b_column')) \
    .show()
```

The resulting DataFrame is not even a little bit similar to the expected output:

```
+--------+--------+-----------+
|a_column|b_column|   replaced|
+--------+--------+-----------+
|  aaa123|     aaa|aaab_column|
|  bbb243|     bbb|bbbb_column|
+--------+--------+-----------+
```

The first solution that comes to mind is using the col function as the third parameter:

```python
.withColumn('replaced', F.regexp_replace('a_column', '\d{3}', F.col('b_column'))) \
```

This attempt fails too because we get `TypeError: Column is not iterable` error.

To solve the problem, we have to use the `expr` function and define the operation as a string:

```python
.withColumn('replaced', F.expr("regexp_replace(a_column, '([0-9]{3})', b_column)"))
```

