Spark Source Code Reading (1) Startup Code Reading 07/13 Update SLTechnology News&Howtos

Spark Source Code Reading (1) Startup Code Reading

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Spark startup code read:

Spark uses a series of shell scripts as an entry: under the bin directory are the scripts for task submission; and the sbin directory is the scripts related to master and worker startup and shutdown.

All scripts end up calling java (scala) code by calling bin/spark-class.

-- spark-class acquires java parameters and starts analysis--

The code processing flow of spark-class:

Call org.apache.spark.launcher.Main and substitute the passed parameters to get the specific parameters:

("$RUNNER"-cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@") replace the parameter, that is, execute:

/ usr/java/jdk/bin/java-cp / home/xxx/spark/lib/spark-assembly-1.5.2-hadoop2.5.0-cdh6.3.2.jar org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit-class com.xxx.xxxx.stat.core.Main-master spark://xxxx1:7077 Xxxx2:7077-- executor-memory 2G-- driver-memory 5G-- total-executor-cores 10 / home/xxx/xxxxxxx/bigdata-xxxxxxx.jar com.xxx.xxxx.stat.xxx.XXXXJob 20180527 20180528

This line of code returns:

/ usr/java/jdk/bin/java-cp / home/xxx/spark/libext/*:/home/xxx/spark/conf/:/home/xxx/spark/lib/spark-assembly-1.5.2-hadoop2.5.0-cdh6.3.2.jar:/home/xxx/spark/lib/datanucleus-api-jdo-3.2.6.jar:/home/xxx/spark/lib/datanucleus-core-3.2.10.jar:/home/xxx/spark/lib/datanucleus -rdbms-3.2.9.jar:/home/xxx/yarn/etc/hadoop-DLOG_LEVEL=INFO-DROLE_NAME=console-Xms5G-Xmx5G-Xss32M-XX:PermSize=128M-XX:MaxPermSize=512M org.apache.spark.deploy.SparkSubmit-master spark://xxxx1:7077 Xxxx2:7077-conf spark.driver.memory=5G-class com.xxx.xxxx.stat.core.Main-executor-memory 2G-total-executor-cores 10 / home/xxx/xxxxxxx/bigdata-xxxxxxx.jar com.xxx.xxxx.stat.xxx.XXXXJob 20180527 20180528

You can see that the main purpose of the org.apache.spark.launcher.Main class is to populate the parameters of the final executed java command. Including stack parameters of classpath, java commands, and so on. The following analysis of the implementation process:

1.org.apache.spark.launcher.Main is a separate path that is not part of core.

The call to org.apache.spark.deploy.SparkSubmit by the class 2.org.apache.spark.launcher.Main creates builder = new SparkSubmitCommandBuilder (args), which is used to normalize the java parameter of the sparksubmit command.

3. For other calls (which should be mainly master and worker calls), builder = new SparkClassCommandBuilder (className, args) is created; used to normalize the java parameters of other commands.

In both cases, the parameters of the java code are generated in two steps by initializing and then calling the buildCommand method of builder.

SparkSubmitCommandBuilder parses the parameters by calling the inner class class OptionParser extends SparkSubmitOptionParser.

Classpath is mainly in

BuildCommand

BuildSparkSubmitCommand

BuildJavaCommand (in parent class)

Get it in a variety of places that may contain classpath.

-Xms5G-Xmx5G these two parameters are obtained by parsing SPARK_DRIVER_MEMORY (spark.driver.memory) in the parameters in buildSparkSubmitCommand

-Xss32M-XX:PermSize=128M-XX:MaxPermSize=512M is obtained through addPermGenSizeOpt (cmd); parsing the configuration item spark.driver.extraJavaOptions in the configuration file (DRIVER_EXTRA_JAVA_OPTIONS = "spark.driver.extraJavaOptions";).

-- spark-class acquires java parameters. End of analysis

First, analyze the frequently used bin/spark-submit (spark task submission, driver process startup), sbin/start-master.sh (background startup, master process startup) and sbin/start-slave.sh (background startup, worker process startup). The startup codes are all in spark-1.5.2\ core\ src\ main\ scala\ org\ apache\ spark\ deploy:

Spark-submit

Class name called through spark-class: org.apache.spark.deploy.SparkSubmit

Let's analyze how driver is started, and analyze the class SparkSubmit.scala.

The main parameters passed are:

-- conf spark.driver.memory=5G

-- class com.xxx.xxxx.stat.core.ExcuteMain

-- executor-memory 2G

-- total-executor-cores 10

/ home/xxx/xxxxxxx/bigdata-xxxxxxx.jar

Com.xxx.xxxx.stat.xxx.XXXXJob

20180527

20180528

The Main function of SparkSubmit gets the further parsed parameters through val appArgs = new SparkSubmitArguments (args), and then calls submit (appArgs) to achieve the commit.

The SparkSubmitArguments class first calls the parse of org.apache.spark.launcher.SparkSubmitOptionParser called by org.apache.spark.launcher.Main above to parse the parameters, and then calls loadEnvironmentArguments to resolve or assign default values to the parameters that may be configured in the environment. Finally, assign the default value of SUBMIT to the action parameter:

Action = Option (action) .getOrElse (SUBMIT)

Let's take a look at the process of submit's method:

Val (childArgs, childClasspath, sysProps, childMainClass) = prepareSubmitEnvironment (args)

a. Define the driver code classes childArgs, childClasspath, sysProps, childMainClass. It can be understood that submit submits a lot of information, such as the number of cores used (the corresponding number of executor), the number of memory used per core, and the driver code executed (driver code is also regarded as a kind of submitted content)

b. Define the cluster management clusterManager and distinguish it according to the prefix of master. Yarn,spark,mesos,local

c. If you define the submission mode deployMode,CLIENT, the driver is on the current machine; CLUSTER uses a worker as the driver.

d. The following code handles the special case of the combination of clusterManager, deployMode, and python (R). We focus on the standalone mode.

e. Fill each parameter into the options variable.

f. For if (deployMode = = CLIENT) {fill in four parameters. ChildMainClass = args.mainClass is populated directly, and it is called by runMain directly in sparkSubmit.

f. For isStandaloneCluster mode (standalone and cluster mode), distinguish between legacy and rest to start a client to adhere to dirver

Populate org.apache.spark.deploy.rest.RestSubmissionClient to childMainClass in rest mode

Legacy mode: org.apache.spark.deploy.Client is populated to childMainClass

Execute the above class in sparksubmit and pass args.mainClass as an argument to the above class.

g. The spark.driver.host parameter is ignored for cluster mode.

h. Returns four parameters

Explanation of four parameters:

This returns a 4-tuple:

(1) the arguments for the child process

(2) a list of classpath entries for the child

(3) a map of system properties, and

(4) the main class for the child calls doRunMain

RunMain

Call the submitted childMainClass

MainClass = Utils.classForName (childMainClass)

Val mainMethod = mainClass.getMethod ("main", new ArrayString.getClass)

MainMethod.invoke (null, childArgs.toArray)

Start-master.sh (spark-daemon.sh is called in the middle)

Class name called through spark-class: org.apache.spark.deploy.master.Master

The parameters with which the call is made:

Start-slave.sh (spark-daemon.sh is called in the middle)

Class name called through spark-class: org.apache.spark.deploy.worker.Worker

The parameters with which the call is made:

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.