When we do data validation in PySpark, it is common to need all columns’ column names with null values. In this article, I show how to get those names for every row in the DataFrame.
Table of Contents
First, I assume that we have a DataFrame df
and an array all_columns
, which contains the names of the columns we want to validate.
We have to create a column containing an array of strings that denote the column names with null values. Therefore, we have to use the when
function to check whether the value is null and pass the column names as the literal value. We use the *
to unpack the array produced by for comprehension into a Spark array:
missing_column_names = array(*[
when(col(c).isNull(),lit(c)) for c in all_column
])
After that, we assign the values to a new column in the DataFrame:
df = df.withColumn("missing_columns", missing_column_names)
Get Weekly AI Implementation Insights
Join engineering leaders who receive my analysis of common AI production failures and how to prevent them. No fluff, just actionable techniques.