When we look at the documentation of regexp_replace
, we see that it accepts three parameters:
- the name of the column
- the regular expression
- the replacement text
Unfortunately, we cannot specify the column name as the third parameter and use the column value as the replacement.
If I have the following DataFrame and use the regex_replace function to substitute the numbers with the content of the b_column
:
from pyspark.sql import DataFrame
from pyspark.sql.types import *
from pyspark.sql import functions as F
df_schema = StructType([StructField('a_column', StringType()), StructField('b_column', StringType())])
test_list = [
['aaa123', 'aaa'],
['bbb243', 'bbb']
]
df = spark_session.createDataFrame(test_list, schema=df_schema)
df \
.withColumn('replaced', F.regexp_replace('a_column', '\d{3}', 'b_column')) \
.show()
The resulting DataFrame is not even a little bit similar to the expected output:
+--------+--------+-----------+
|a_column|b_column| replaced|
+--------+--------+-----------+
| aaa123| aaa|aaab_column|
| bbb243| bbb|bbbb_column|
+--------+--------+-----------+
The first solution that comes to mind is using the col function as the third parameter:
.withColumn('replaced', F.regexp_replace('a_column', '\d{3}', F.col('b_column'))) \
This attempt fails too because we get TypeError: Column is not iterable
error.
To solve the problem, we have to use the expr
function and define the operation as a string:
.withColumn('replaced', F.expr("regexp_replace(a_column, '([0-9]{3})', b_column)"))
Want to build AI systems that actually work?
Download my expert-crafted GenAI Transformation Guide for Data Teams and discover how to properly measure AI performance, set up guardrails, and continuously improve your AI solutions like the pros.