When we run a Spark cluster on EMR, we often create a new cluster for every Spark job. In that case, we want to use all available resources, but changing the configuration is annoying and error-prone. How many times have you forgotten to change the Spark settings after changing the EMR instances to less powerful ones? Of course, the Spark job failed because it could not allocate the resources you wanted.

What is even worse, when you forget to change the settings after changing the instance to a bigger one, you pay for a better cluster, but you are not using it entirely.

Fortunately, Spark’s EMR version has a special configuration parameter that replaces all of the cumbersome parameters, such as the executor memory, the executor cores, or parallelism.

Instead of them, we should enable the maximizeResourceAllocation feature:

--conf maximizeResourceAllocation=true

when we call the spark-submit script.

Stop AI Hallucinations Before They Cost You.
Join engineering leaders getting weekly tactics to prevent failure in customer-facing AI systems. Straight from real production deployments.
Stop AI Hallucinations Before They Cost You.
Join engineering leaders getting weekly tactics to prevent failure in customer-facing AI systems. Straight from real production deployments.
Older post

How to use AWSAthenaOperator in Airflow to verify that a DAG finished successfully

How to check that an AWS Athena table contains data after running an Airflow DAG.

Newer post

How to find and terminate an idle Redshift session

How to find the idle session that is blocking the connection pool in Redshift

Engineering leaders: Is your AI failing in production? Take the 10-minute assessment
>
×
Stop AI Hallucinations Before They Cost You.
Join engineering leaders getting weekly tactics to prevent failure in customer-facing AI systems. Straight from real production deployments.