---
title: "How to read from SQL table in PySpark using a query instead of specifying a table"
description: "Fetching data using a SQL query in PySpark"
author: "Bartosz Mikulski"
author_bio: "Principal AI Engineer & MLOps Architect. I bridge the gap between \"it works in a notebook\" and \"it works for 200 million users.\""
author_url: https://mikulskibartosz.name
author_linkedin: https://www.linkedin.com/in/mikulskibartosz/
author_github: https://github.com/mikulskibartosz
canonical_url: https://mikulskibartosz.name/read-from-sql-in-pyspark-using-query
---

When we know precisely what query we should run to get the data we want from a SQL database, we don't need to load multiple tables in PySpark, and emulate the joins and selects in the Python code. Instead of that, we can pass the SQL query as the source of the DataFrame while retrieving it from the database.

If my code to retrieve the data looks like this:

```python
df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:mysql://localhost:port") \
    .option("dbtable", "schema.tablename") \
    ...
    .load()
```

I can replace the `dbtable` parameter with a SQL query and use the result as the table loaded by PySpark:

```python
.option("dbtable", "(SELECT column_A, column_B FROM some_table) AS tbl")
```

