When we do data validation in PySpark, it is common to need all columns’ column names with null values. In this article, I show how to get those names for every row in the DataFrame.

First, I assume that we have a DataFrame df and an array all_columns, which contains the names of the columns we want to validate.

We have to create a column containing an array of strings that denote the column names with null values. Therefore, we have to use the when function to check whether the value is null and pass the column names as the literal value. We use the * to unpack the array produced by for comprehension into a Spark array:

missing_column_names = array(*[
    when(col(c).isNull(),lit(c)) for c in all_column
])

After that, we assign the values to a new column in the DataFrame:

df = df.withColumn("missing_columns", missing_column_names)

Want to build AI systems that actually work?

Download my expert-crafted GenAI Transformation Guide for Data Teams and discover how to properly measure AI performance, set up guardrails, and continuously improve your AI solutions like the pros.

Want to build AI systems that actually work?

Download my expert-crafted GenAI Transformation Guide for Data Teams and discover how to properly measure AI performance, set up guardrails, and continuously improve your AI solutions like the pros.

Older post

How to set a different retry delay for every task in an Airflow DAG

How to use a different retry delay in every Airflow task

Newer post

How to combine two DataFrames with no common columns in Apache Spark

Use full outer join to combine two Apache Spark DataFrames with no common columns

Are you looking for an experienced AI consultant? Do you need assistance with your RAG or Agentic Workflow?
Book a Quick Consultation, send me a message on LinkedIn. Book a Quick Consultation or send me a message on LinkedIn

>