What are the useful options for spark-submit 04/25 Update SLTechnology News&Howtos

What are the useful options for spark-submit

2025-04-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail what are the useful options about spark-submit, and the content of the article is of high quality, so the editor will share it for you as a reference. I hope you will have a certain understanding of the relevant knowledge after reading this article.

When we use spark-submit, we have to deal with our own configuration files, ordinary files, and jar packages. Today we won't talk about how they went, but we'll talk about where they all went, so that we can better locate the problem.

When we use spark-submit to submit our own code to the yarn cluster to run, spark will generate two process roles on the yarn cluster, one is driver and the other is executor. When these two role processes require us to pass some resources and information, we often use the option of spark-submit to pass them. So where do these resources and information go after they are specified using spark-submit, and why can driver and executor, which are far away in the computer room, read these things correctly? Why do I specify these things according to the help of spark-submit, but driver or executor still report an error? This article provides a method to help you locate related issues.

Yarn configuration

In fact, both driver and executor of spark need to pull these resources locally before they can be used normally. Yarn provides resource containers such as container for driver and executor to start these processes, but container is also bound to the server, that is to say, although driver and executor can start after applying for a certain amount of cpu and memory, they will also involve dealing with persistent storage. Then we need to configure such a local disk directory to tell the container started in yarn where we can temporarily store these files if files are involved. This configuration information is in the core-site.xml of hadoop, which is hadoop.tmp.dir:

Hadoop.tmp.dir

/ Users/liyong/software/hadoop/hadoop-2.7.6/tmp

A base for other temporary directories.

Once we have configured this directory, on the remote server, after starting container, the directory goes back to the working directory of the process bound to container.

Verify it.

In order for everyone to verify immediately, we do not write our own code, so we do not need to build an environment and pack up these messy things. We can just download the compiled package of spark, and it is recommended that you verify it on a stand-alone machine first, so that you do not have to log in to other nodes of the cluster. First of all, let's look at the simplest example:

. / bin/spark-submit\

-- class org.apache.spark.examples.SparkPi\

-- master yarn\

-- num-executors 2\

Examples/target/original-spark-examples_2.11-2.3.0.jar 100000

After we submit this task to yarn, let's take a look at the yarn temporary directory just configured, and a subdirectory related to the job submitted this time will be generated:

. / nm-local-dir/usercache/liyong/appcache/application_1529231285216_0012

The last application_1529231285216_0012 is the applicationId of our current job. After entering this directory, we can see many subdirectories, among which the one that begins with container is the working directory of job's driver and executor that we submitted. Let's take a look at any one, and there are many subdirectories and subfiles:

Original-spark-examples_2.11-2.3.0.jar

This is the jar file packaged by the driver code of job that we submitted this time, which has been sent over the network for the jvm of executor to load the class.

_ _ spark_libs__

This subdirectory stores a series of jar packages that the computing framework itself depends on. We can see that there are a total of 240jar packages, so let's go back to the root directory of the spark project and take a look at the assembly/target/scala-2.11/jars/ directory. There are just 240jar packages, indicating that when spark submits the task, it sends its dependent jar packages to the local directory of each container. The file system, configuration, network, cluster and other related functions required by the spark computing framework need the support of these jar packages.

_ _ spark_conf__

This subdirectory stores the relevant configuration files we specified, including two:

The _ _ hadoop_conf__ stores the hadoop-related configuration file we specified through the HADOOP_CONF_DIR environment variable. Let's frame a few familiar ones:

You can open these files one by one and compare them with our hadoop client configuration files. There is also a _ _ spark_conf__.properties file is our conf/spark-defaults.conf file, do not believe you open a comparison, in fact, that is to change a vest, we must know the wheel ah.

-- jars option

English description:

Comma-separated list of jars to include on the driver and executor classpaths.

Chinese explanation:

Requires a list of jar packages that driver and executor can find under their classpath, that is, the jar packets specified by the spark client through this option will be sent to the classpaths of the node where driver and executor are located. When we develop spark applications, we often need or use some dependent jar packages that the spark computing framework does not have, so we can package all the needed dependencies together when using maven or IDE, but this is not a good way, because in this way, the application package and the dependency package are too coupled, and if there are too many dependencies, our packaging process will be very slow. Manually uploading the package to the server can also be slow, which slows down our entire testing and verification process, so we can use the-jars option to have the spark computing framework distribute the dependencies we need. Let's check it out:

. / bin/spark-submit\

-- class org.apache.spark.examples.SparkPi\

-- master yarn\

-- num-executors 2\

-jars / Users/liyong/.m2/repository/org/eclipse/jetty/jetty-plus/9.3.20. V20170531/jetty-plus-9.3.20.v20170531.jar\

Examples/target/original-spark-examples_2.11-2.3.0.jar 100000

In the above command, we specify a jar package that we need to use on both the driver side and the executor side through-- jars: jetty-plus-9.3.20.v20170531.jar, let's execute it, and then go to the directory where the application is located. At this time, we find that under the local directory of each container, the jetty-plus-9.3.20.v20170531.jar package is already lying there safely, so next time we will encounter the problem that the class cannot be found. We can go to this directory to see if the jar needed for JVM class loading is in this directory. If not, it will definitely report an exception that the class cannot find. If it is, then we can use the jar or unzip command to extract the jar package to see if there are any needed JVM files. Mom doesn't have to worry that I can't find a class when I run spark!

-- files option

English description:

Comma-separated list of files to be placed in the working directory of each executor. File paths of these files in executors can be accessed via SparkFiles.get (fileName).

Chinese explanation:

The file specified by this option will be placed in executor's working directory, so that executor can return the absolute local pathname of the file through the SparkFiles.get (fileName) method, and then access the file in a variety of ways.

When we write spark applications, in addition to providing spark with jar package dependencies for class loading, sometimes we also need to use some common file resources. For example, if we want to do location-related development, we need to use files such as IP address packages. Or we will use some of hive's small tables (usually small dimension tables) files to associate queries with these table files in spark, then spark provides an option such as-- files to help us do this. Note that it is stated here that the file will be temporarily stored in the working directory of executor, not in the working directory of driver, but after testing, it is found that both driver and executor can know this file in the working directory. Let's check it out:

. / bin/spark-submit\

-- class org.apache.spark.examples.SparkPi\

-- master yarn\

-- num-executors 2\

-- jars / Users/liyong/.m2/repository/org/eclipse/jetty/jetty-plus/9.3.20.v20170531/jetty-plus-9.3.20.v20170531.jar\

-- files README.md\

Examples/target/original-spark-examples_2.11-2.3.0.jar 100000

In our submit command, the README.md file in a client directory is specified by-- files selection. All right, let's execute it:

We can see that this README.md file can be found in the working directory of the three local container (including the container where driver is located).

-- properties-file option

English description:

Path to a file from which to load extra properties. If not specified, this will look for conf/spark-defaults.conf.

Chinese explanation:

Specify the configuration information through this file, and if not, spark will use the conf/spark-defaults.conf file as the default configuration file. Well, this explanation is very clear, we just need to verify it:

. / bin/spark-submit\

-- class org.apache.spark.examples.SparkPi\

-- master yarn\

-- num-executors 2\

-- jars / Users/liyong/.m2/repository/org/eclipse/jetty/jetty-plus/9.3.20.v20170531/jetty-plus-9.3.20.v20170531.jar\

-- properties-file conf/myown-spark.conf\

-- files README.md\

Examples/target/original-spark-examples_2.11-2.3.0.jar 100000

We make a copy of the spark-default.conf file under the conf directory of the spark client and name it myown-spark.conf. To distinguish it from spark-default.conf, we configure three configuration items in our own configuration file that are not used in spark-default.conf:

Spark.serializer org.apache.spark.serializer.KryoSerializer

Spark.driver.memory 1g

Spark.executor.extraJavaOptions-XX:+PrintGCDetails

For submission, we can find the file _ _ spark_conf__.properties in the _ _ spark_conf__/ directory of container in the temporary directory. The file name is the same as before. Let's open it to see if several configuration items we have configured are in it:

See, the red quilt is the three configuration items in our own configuration file. At the same time, we should note that to use spark's configuration framework, all configuration items need to use spark as a prefix, if we do not want to use this way, then we need to cooperate with the-- files option to prevent our own configuration files as ordinary resource files to container's working directory, and then use java or scala configuration file sdk to load.

What are the useful options about spark-submit to share here, I hope the above content can be of some help to you, you can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.