Categories
spark

spark executor memory and cores

In this post let us discuss on the spark executor memory and cores .

Factors need to consider while calculating

spark submit configuration parameters include number of executors , executor core and executor memory . Determining values for the same include set of calculations and depends on few factors which we would see below.

Hadoop/OS Deamons : NameNode, Secondary NameNode, DataNode, JobTracker and TaskTracker are the daemons running in background while spark application run through cluster manager like yarn . We need to allocate few cores for these daemons as well which is approximately of about 1 core per node.

Yarn ApplicationMaster (AM): This will get the enough resources from the ResourceManager and work along with the NodeManagers to execute and monitor the containers and their resource consumption.It would need (~1024MB and 1 Executor).

HDFS Throughput: HDFS client achieves full write throughput with ~5 tasks per executor whereas it faces difficulty with lot of concurrent threads. So it’s better to have the number of cores per executor lesser than 5.

MemoryOverhead:
Full memory requested to yarn per executor =
spark-executor-memory + spark.yarn.executor.memoryOverhead.

spark.yarn.executor.memoryOverhead =
Max(384MB, 7% of spark.executor-memory)

So, if we request 20GB per executor, AM will actually get 20GB + memoryOverhead = 20 + 7% of 20GB = ~23GB memory for us.

Cluster configuration

Will calculate number of executor , executor cores and executor memory for the following cluster Configuration :
10 Nodes
16 cores per Node
64GB RAM per Node

Running executors with too much memory often results in excessive garbage collection delays.
Whereas running tiny executors (with a single core and just enough memory needed to run a single task, for example) throws away the benefits that come from running multiple tasks in a single JVM.

Calculating values for parameters

1) To identify the number of available of cores within a cluster
Number of cores = available number of nodes in a cluster * number of cores per node
= 10 * 16
= 160 cores

2) Total memory = available number of nodes in a cluster * memory per node
= 10 * 64 GB
= 640 GB

3) Within the available number of cores 160 , one core per node needs to be allocated for deamons
Remaining cores after allocating one core per node for daemons = 160 – (1 * number of nodes )
= 160 – (1 * 10)
= 160 – 10
= 150 cores

4) Total executor = number of cores / 5(recommended concurrent threads for maximum throughput )
= 150 / 5
= 30

5) Executor per node = Total executor / number of nodes
= 30 / 10
= 3

6) Memory per executor = Total memory / Total executor
= 640 GB / 30
= 21 GB

7) MemoryOverhead = Max(384MB, 7% of spark.executor-memory)
= Max(384MB, 7% 21GB)
= Max(384MB, 1.47 GB)
= ~2 GB

8) Final Executor memory = Memory per executor – memory overhead
= 21 – 2
= 19 GB

9) Remaining executors = Total executor – One executor for Application manager
= 30 – 1
= 29

So , the maximum value we can have for the configuration parameters is as follows

Executor memory – 19 GB 

Total available executors – 29

And for driver memory , we can have the same size as executor memory , I t dint require any specific calculation.

https://spoddutur.github.io/spark-notes/distribution_of_executors_cores_and_memory_for_spark_application.html