Categories
spark

spark submit options

In this post , let us get to know about spark submit options .

spark submit options

spark submit options mentioned below contain < > .We can replace them with respective values . Following are the few of the prominent spark submit options available .

Everything will contain option name and its values . For instance , — master is the option name and can pass value next to it .

spark-submit \
--master < > \
--deploy-mode < > \
--keytab < > \
--principal < > \
--driver-memory < > \
--executor-memory  < > \
--executor-cores  < > \
--num-executors  < > \
--class  < > \
<jar name>
--conf <key>=<value> 

master : It is suitable for local mode

deploy mode : default – client

This is applicable only for cluster set up – YARN , standalone . It can have client or cluster as the value . It depends on where driver program needs to run . That is on worker node (cluster mode ) or local machine (client mode) .

keytab and principal : 

Every host that provides a service must have a local file, called a keytab . This file contains the pairs of Kerberos principals and encrypted keys. It allow scripts to authenticate using Kerberos automatically . Without any  involvement of human while accessing the password stored in a file.

driver-memory : Memory required for driver program which enclose the main method . The default value is 1 GB 

executor-memory : Executors are worker nodes responsible for running individual tasks in a given Spark job . Memory allocated for each executor which could be assigned based on some calculation.

executor-cores : Number of cores each executor can have.

num-executors : Specify the number of executors to have .Running executors with too much memory often results in excessive garbage collection delays.
Whereas running tiny executors throws away the benefits that come from running multiple tasks in a single JVM. (with a single core and just enough memory needed to run a single task, for example) . It is better to calculate before assigning .

class : class file name 

Other options :

–conf <key>=<value>

For instance , –conf spark.sql.files.maxPartitionBytes = 128

spark.sql.files.maxPartitionBytes : default (128 MB)
spark.sql.files.openCostInBytes : default (4 MB)
spark.sql.files.minPartitionNum
spark.sql.broadcastTimeout : default (300)
spark.sql.autoBroadcastJoinThreshold : default (10 MB)
spark.sql.shuffle.partitions : default (200)
spark.sql.sources.parallelPartitionDiscovery.threshold : default (32)
spark.sql.sources.parallelPartitionDiscovery.parallelism : default (10000)

Reference

https://spark.apache.org/docs/latest/sql-performance-tuning.html#other-configuration-options