Categories
spark

spark submit options

In this post , let us get to know about spark submit options .

spark submit options

spark submit options mentioned below contain < > .We can replace them with respective values . Following are the few of the prominent spark submit options available .

Everything will contain option name and its values . For instance , — master is the option name and can pass value next to it .

spark-submit \
--master < > \
--deploy-mode < > \
--keytab < > \
--principal < > \
--driver-memory < > \
--executor-memory  < > \
--executor-cores  < > \
--num-executors  < > \
--class  < > \
<jar name>
--conf <key>=<value> 

master : It is suitable for local mode

deploy mode : default – client

This is applicable only for cluster set up – YARN , standalone . It can have client or cluster as the value . It depends on where driver program needs to run . That is on worker node (cluster mode ) or local machine (client mode) .

keytab and principal : 

Every host that provides a service must have a local file, called a keytab . This file contains the pairs of Kerberos principals and encrypted keys. It allow scripts to authenticate using Kerberos automatically . Without any  involvement of human while accessing the password stored in a file.

driver-memory : Memory required for driver program which enclose the main method . The default value is 1 GB 

executor-memory : Executors are worker nodes responsible for running individual tasks in a given Spark job . Memory allocated for each executor which could be assigned based on some calculation.

executor-cores : Number of cores each executor can have.

num-executors : Specify the number of executors to have .Running executors with too much memory often results in excessive garbage collection delays.
Whereas running tiny executors throws away the benefits that come from running multiple tasks in a single JVM. (with a single core and just enough memory needed to run a single task, for example) . It is better to calculate before assigning .

class : class file name 

Other options :

–conf <key>=<value>

For instance , –conf spark.sql.files.maxPartitionBytes = 128

spark.sql.files.maxPartitionBytes : default (128 MB)
spark.sql.files.openCostInBytes : default (4 MB)
spark.sql.files.minPartitionNum
spark.sql.broadcastTimeout : default (300)
spark.sql.autoBroadcastJoinThreshold : default (10 MB)
spark.sql.shuffle.partitions : default (200)
spark.sql.sources.parallelPartitionDiscovery.threshold : default (32)
spark.sql.sources.parallelPartitionDiscovery.parallelism : default (10000)

Reference

https://spark.apache.org/docs/latest/sql-performance-tuning.html#other-configuration-options

Categories
spark

spark local and standalone mode

In this post , let us have a look at spark local and standalone mode .

Local mode

Other than the local and standalone mode which we are going to see in this post , we do have few other deployment mode  as well .

Local Mode is the default mode of spark which runs everything on the same machine.

In the case of not mentioning –master flag to the command whether spark-shell or spark-submit , ideally it means it is running in local mode.

Other way is to pass –master option with local as argument which defaults to 1 thread. 

We can even increase the number of threads by providing the required number within the square bracket . For instance , spark-shell –master local[2] .

By using asterisks instead like local[*] we can use as many threads as the number of processors available to the Java virtual machine.

spark-submit --class <class name> --master local[8] <jar file>

Standalone mode

  • Spark standalone cluster in client deploy mode
  • Spark standalone cluster in cluster deploy mode with supervise
  • Run a Python application on a Spark standalone cluster
Spark standalone cluster in client deploy mode

Application will submit on the gateway machine which is interlinked with any other worker machines physically . The input and output of the application is attached to the console . And so this mode well suite for the application which include REPL (i.e Spark shell) .In client mode, the driver launches directly within the spark-submit process which acts as a client to the cluster.

spark-submit --class <class name> --master <spark://host id> --executor-memory 20G --total-executor-cores 100 <jar name> 
Spark standalone cluster in cluster deploy mode with supervise

For a Spark standalone cluster with cluster deploy mode, you can also provide –supervise.  The driver restarts automatically incase of any kind of failures with a non-zero exit code.

Few applications will submit from a machine far from the local machine . It is common to use cluster mode to minimize network latency between the drivers and the executors.

spark-submit --class <class name> --master <spark://host id> --deploy-mode cluster --supervise --executor-memory 20G --total-executor-cores 100 <jar name>
Run a Python application on a Spark standalone cluster

Currently, the standalone mode does not support cluster mode for Python applications.

spark-submit --master <spark:host id> <python file>

Reference

https://spark.apache.org/docs/latest/submitting-applications.html