When we know precisely what query we should run to get the data we want from a SQL database, we don’t need to load multiple tables in PySpark, and emulate the joins and selects in the Python code. Instead of that, we can pass the SQL query as the source of the DataFrame while retrieving it from the database.

Table of Contents

  1. Get Weekly AI Implementation Insights

If my code to retrieve the data looks like this:

df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:mysql://localhost:port") \
    .option("dbtable", "schema.tablename") \
    ...
    .load()

I can replace the dbtable parameter with a SQL query and use the result as the table loaded by PySpark:

.option("dbtable", "(SELECT column_A, column_B FROM some_table) AS tbl")

Get Weekly AI Implementation Insights

Join engineering leaders who receive my analysis of common AI production failures and how to prevent them. No fluff, just actionable techniques.

Get Weekly AI Implementation Insights

Join engineering leaders who receive my analysis of common AI production failures and how to prevent them. No fluff, just actionable techniques.

Older post

How to restart a stuck Airflow DAG

What to do when an Airflow DAG gets stuck and does not want to run

Newer post

Don't learn another programming language

Should you learn a new programming language this year?

Engineering leaders: Is your AI failing in production? Take the 10-minute assessment
>