To submit a PySpark job using SSHOperator in Airflow, we need three things:
- an existing SSH connection to the Spark cluster
- the location of the PySpark script (for example, an S3 location if we use EMR)
- parameters used by PySpark and the script
The usage of the operator looks like this:
from airflow.contrib.operators.ssh_operator import SSHOperator
script = 's3://some_bucket/script.py'
spark_parameters = '--executor-memory 100G'
# here we can use Airflow template to define the parameters used in the script
parameters = '--db {{ params.database_instance }}, --output_path {{ params.output_path }}'
submit_pyspark_job = SSHOperator(
task_id='pyspark_submit'
ssh_conn_id='ssh_connection',
command='set -a; PYSPARK_PYTHON=python3; /usr/bin/spark-submit --deploy-mode cluster %s %s %s' % (spark_parameters, script, parameters),
dag=dag
)