When we want to use external dependencies in the PySpark code, we have two options. We can either pass them as jar files or Python scripts.
In this article, I will show how to do that when running a PySpark job using AWS EMR. The jar and Python files will be stored on S3 in a location accessible from the EMR cluster (remember to set the permissions).
First, we have to add the --jars
and --py-files
parameters to the spark-submit
command while starting a new PySpark job:
spark-submit --deploy-mode cluster \
--jars s3://some_bucket/java_code.jar \
--py-files s3://some_bucket/python_code.py \
s3://some_bucket/pyspark_job.py
In the pyspark_job.py
file, I can import the code from the jar file just like any other dependency.
import python_code.something