In this article, I will show you how to combine two Spark DataFrames that have no common columns.

Table of Contents

  1. Get Weekly AI Implementation Insights

For example, if we have the two following DataFrames:

val df1 = Seq(
        ("001","002","003"),
        ("004","005","006")
    ).toDF("A","B","C")
val df2 = Seq(
        ("011","022","033"),
        ("044","055","066")
    ).toDF("D","E","F")

The output I want to get looks like this:

+----+----+----+----+----+----+
|   A|   B|   C|   D|   E|   F|
+----+----+----+----+----+----+
| 001| 002| 003|null|null|null|
| 004| 005| 006|null|null|null|
|null|null|null| 011| 022| 033|
|null|null|null| 044| 055| 066|
+----+----+----+----+----+----+

This can be easily achieved by using the full outer join with the condition set to false:

df1.join(df2, lit(false), "full")

It works because the full outer join takes all rows from both DataFrames, so we end up with all rows, and we use lit(false) as the joining condition, which ensures that there will be no matches between both DataFrames.

Get Weekly AI Implementation Insights

Join engineering leaders who receive my analysis of common AI production failures and how to prevent them. No fluff, just actionable techniques.

Get Weekly AI Implementation Insights

Join engineering leaders who receive my analysis of common AI production failures and how to prevent them. No fluff, just actionable techniques.

Older post

How to get names of columns with missing values in PySpark

How to get the names of missing properties for every row in a PySpark Dataframe

Newer post

How to decode base64 to text in AWS Athena

How to use from_base64 in AWS Athena

Engineering leaders: Is your AI failing in production? Take the 10-minute assessment
>