I am running a spark job with 3 files each of 100MB size, for some reason my spark UI shows all dataset concentrated into 2 executors.This is making the job run for 19 hrs and still running. Below is my spark configuration . spark 2.3 is the version used.
spark2-submit --class org.mySparkDriver
--master yarn-cluster
--deploy-mode cluster
--driver-memory 8g
--num-executors 100
--conf spark.default.parallelism=40
--conf spark.yarn.executor.memoryOverhead=6000mb
--conf spark.dynamicAllocation.executorIdleTimeout=6000s
--conf spark.executor.cores=3
--conf spark.executor.memory=8G
I tried repartitioning inside the code which works , as this makes the file go into 20 partitions (i used rdd.repartition(20)). But why should I repartition , i believe specifying spark.default.parallelism=40 in the script should let spark divide the input file to 40 executors and process the file in 40 executors.
Can anyone help.
Thanks, Neethu
See Question&Answers more detail:os