---
title: "How to run PySpark code using the Airflow SSHOperator"
description: "How to submit a PySpark job using SSHOperator in Airflow"
author: "Bartosz Mikulski"
author_bio: "Principal AI Engineer & MLOps Architect. I bridge the gap between \"it works in a notebook\" and \"it works for 200 million users.\""
author_url: https://mikulskibartosz.name
author_linkedin: https://www.linkedin.com/in/mikulskibartosz/
author_github: https://github.com/mikulskibartosz
canonical_url: https://mikulskibartosz.name/run-pyspark-code-using-airflow-sshoperator
---

To submit a PySpark job using SSHOperator in Airflow, we need three things:

* an existing SSH connection to the Spark cluster
* the location of the PySpark script (for example, an S3 location if we use EMR)
* parameters used by PySpark and the script

The usage of the operator looks like this:

```python
from airflow.contrib.operators.ssh_operator import SSHOperator

script = 's3://some_bucket/script.py'
spark_parameters = '--executor-memory 100G'
# here we can use Airflow template to define the parameters used in the script
{% raw %}parameters = '--db {{ params.database_instance }}, --output_path {{ params.output_path }}' {% endraw %}

submit_pyspark_job = SSHOperator(
    task_id='pyspark_submit'
    ssh_conn_id='ssh_connection',
    command='set -a; PYSPARK_PYTHON=python3; /usr/bin/spark-submit --deploy-mode cluster %s %s %s' % (spark_parameters, script, parameters),
    dag=dag
)
```

