When we want to use external dependencies in the PySpark code, we have two options. We can either pass them as jar files or Python scripts.

In this article, I will show how to do that when running a PySpark job using AWS EMR. The jar and Python files will be stored on S3 in a location accessible from the EMR cluster (remember to set the permissions).

First, we have to add the --jars and --py-files parameters to the spark-submit command while starting a new PySpark job:

spark-submit --deploy-mode cluster \
    --jars s3://some_bucket/java_code.jar \
    --py-files s3://some_bucket/python_code.py \
    s3://some_bucket/pyspark_job.py

In the pyspark_job.py file, I can import the code from the jar file just like any other dependency.

import python_code.something
Stop AI Hallucinations Before They Cost You.
Join engineering leaders getting weekly tactics to prevent failure in customer-facing AI systems. Straight from real production deployments.
Stop AI Hallucinations Before They Cost You.
Join engineering leaders getting weekly tactics to prevent failure in customer-facing AI systems. Straight from real production deployments.
Older post

How to reset the consumer offset in Apache Kafka topic

How to use kafka-consumer-groups.sh to reset topic offsets

Newer post

How to write to a SQL database using JDBC in PySpark

How to use JDBC driver in PySpark to write a DataFrame to a SQL database

Engineering leaders: Is your AI failing in production? Take the 10-minute assessment
>
×
Stop AI Hallucinations Before They Cost You.
Join engineering leaders getting weekly tactics to prevent failure in customer-facing AI systems. Straight from real production deployments.