How to use spark-submit, a Spark1.0.0 application deployment tool 07/09 Update SLTechnology News&Howtos

How to use spark-submit, a Spark1.0.0 application deployment tool

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

In this issue, Xiaobian will bring you information about how to use Spark1.0.0 application deployment tool spark-submit. The article is rich in content and analyzes and narrates from a professional perspective. After reading this article, I hope you can gain something.

As Spark becomes more widely used, the need for application deployment tools that support multiple resource managers becomes more urgent. Spark1.0.0, this problem has been gradually improved. Starting with Spark1.0.0, Spark provides an easy-to-use application deployment tool bin/spark-submit for quick deployment of Spark applications on local, Standalone, YARN, and Mesos.

1: Instructions for use

Go to the $SPARK_HOME directory and type bin/spark-submit --help to get help with this command.

hadoop@wyy:/app/hadoop/spark100$ bin/spark-submit --help

Usage: spark-submit [options] [app options]

copy code

Options:

--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.

--deploy-mode DEPLOY_MODE Where driver runs, client runs locally, cluster runs in cluster

--class CLASS_NAME class to run for application package

--name Name of application

--jars JARS Comma-separated list of driver local jar packages and executor classpath

--py-files PY_FILES Comma-separated list of.zip, .egg, .py files placed on PYTHONPATH, a Python application

--files FILES Comma-separated list of files to be placed in each executor working directory

--properties-file FILE Sets the file location of application properties, default is conf/spark-defaults.conf

--driver-memory MEM driver Memory size, default 512 MB

--driver-java-options java options for driver

--driver-library-path driver Extra library path entries to pass to the driver

--driver-class-path driver classpath, jar packages added with--jars are automatically included in the classpath

--executor-memory MEM executor Memory size, default 1G

Spark standalone with cluster deploy mode only:

--driver-cores NUM driver uses cores, default is 1

--supervise If this parameter is set, the driver will restart if it fails

Spark standalone and Mesos only:

--total-executor-cores NUM executor Total number of cores used

YARN-only:

--executor-cores NUM Number of cores used per executor, default 1

--queue QUEUE_NAME Queue to which YARN the application is submitted, default queue

--num-executors NUM Number of executors started, default is 2

--archives ARCHIVES List of archives extracted into the working directory by each executor, separated by commas

With regard to the spark-submit help message above, there are a few points that need to be emphasized:

Using a phenomenon like--master spark://host:port --deploy-mode cluster submits driver to cluster and then worker to kill.

If you want to use--properties-file, the properties defined in--properties-file do not need to be defined in spark-sumbit, for example, spark.master is defined in conf/spark-defaults.conf, you can not use--master. The priority of Spark attributes is SparkConf mode> Command line parameter mode> File configuration mode. For details, see Spark1.0.0 attribute configuration.

Unlike previous versions, Spark1.0.0 automatically passes its own jar packages and jar packages from the--jars option to the cluster.

Spark uses several URIs to handle file propagation:

file://uses file://and absolute path, which is provided by the HTTP server of the driver, and each executor pulls files from the driver.

hdfs:, http:, https:, ftp: executor Pull files directly from URLs

local: executor Files that exist locally and do not need to be pulled back; they can also be files shared over NFS networks.

If you need to see where the configuration options come from, you can use the--verbose option to generate more detailed operational information for reference.

2: Test environment

The test program is derived from the Spark1.0.0 application developed using IntelliJ IDEA and will test two classes, WordCount1 and WordCount2.

The test data comes from Sogou's user query log (SogouQ). For details, see Spark1.0.0 Development Environment Quick Build. Although it is not ideal to test with this data set, because its complete version is large enough, some of the data can be divided for testing. In addition, other routines need to be used, so this data set will be adopted. In the experiment, 100000 lines (SogouQ1.txt) and 200000 lines (SogouQ2.txt) were intercepted respectively for experiments.

3: Preparations

A: Cluster

Switch to user hadoop Start Spark1.0.0 development environment Virtual cluster built in Quick Build

[hadoop@hadoop1 ~]$ su - hadoop

[hadoop@hadoop1 ~]$ cd /app/hadoop/hadoop220

[hadoop@hadoop1 hadoop220]$ sbin/start-all.sh

[hadoop@hadoop1 hadoop220]$ cd ../ spark100/

[hadoop@hadoop1 spark100]$ sbin/start-all.sh

copy code

B: Client

Switch to user hadoop on the client side and switch to the/app/hadoop/spark1.0.0 directory, upload the experimental data to the hadoop cluster, and then copy the packages generated by the Spark1.0.0 application developed using IntelliJ IDEA.

mmicky@wyy:~/data$ su - hadoop

hadoop@wyy:~$ cd /app/hadoop/hadoop220

hadoop@wyy:/app/hadoop/hadoop220$ bin/hdfs dfs -mkdir -p /dataguru/data

hadoop@wyy:/app/hadoop/hadoop220$ bin/hdfs dfs -put /home/mmicky/data/SogouQ1.txt /dataguru/data/

hadoop@wyy:/app/hadoop/hadoop220$ bin/hdfs dfs -put /home/mmicky/data/SogouQ2.txt /dataguru/data/

copy code

Check block distribution for SogouQ1.txt

hadoop@wyy:/app/hadoop/hadoop220$ bin/hdfs fsck /dataguru/data/SogouQ1.txt -files -blocks -locations -racks

Connecting to namenode via http://hadoop1:50070

FSCK started by hadoop (auth:SIMPLE) from /192.168.1.111 for path /dataguru/data/SogouQ1.txt at Sat Jun 14 03:47:39 CST 2014

/dataguru/data/SogouQ1.txt 108750574 bytes, 1 block(s): OK

0. BP-1801429707-192.168.1.171-1400957381096:blk_1073741835_1011 len=108750574 repl=1 [/default-rack/192.168.1.171:50010]

copy code

Check block distribution for SogouQ2.txt

hadoop@wyy:/app/hadoop/hadoop220$ bin/hdfs fsck /dataguru/data/SogouQ2.txt -files -blocks -locations -racks

Connecting to namenode via http://hadoop1:50070

FSCK started by hadoop (auth:SIMPLE) from /192.168.1.111 for path /dataguru/data/SogouQ2.txt at Sat Jun 14 03:48:07 CST 2014

/dataguru/data/SogouQ2.txt 217441417 bytes, 2 block(s): OK

0. BP-1801429707-192.168.1.171-1400957381096:blk_1073741836_1012 len=134217728 repl=1 [/default-rack/192.168.1.173:50010]

1. BP-1801429707-192.168.1.171-1400957381096:blk_1073741837_1013 len=83223689 repl=1 [/default-rack/192.168.1.172:50010]

copy code

Switch to the spark directory and copy the package

hadoop@wyy:/app/hadoop/hadoop220$ cd ../ spark100

hadoop@wyy:/app/hadoop/spark100$ cp /home/mmicky/IdeaProjects/week2/out/artifacts/week2/week2.jar .

copy code

4: Experiment

The following gives several commands for experimental CASE, and the specific running architecture will extract several examples in Spark1.0.0 on Standalone running architecture instance parsing instructions.

When submitting a spark application using spark-submit, note the following:

When deploying a Spark application to Spark Standalone from an out-of-cluster client, be careful to implement SSH passwordless login between the client and Spark Standalone beforehand.

When deploying spark applications to YARN, pay attention to the size of executor-memory, its memory plus the memory to be used by the container (default value is 1G) does not exceed NM available memory, otherwise no container can be allocated to run executor.

The above is how to use Spark1.0.0 application deployment tool spark-submit shared by Xiaobian. If you happen to have similar doubts, you may wish to refer to the above analysis for understanding. If you want to know more about it, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.