What is the method of Spark submission? 07/19 Update SLTechnology News&Howtos

What is the method of Spark submission?

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article focuses on "what is the way to submit Spark". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "what is the way to submit Spark?"

1. What is Spark?

High scalability of ○

○ has high fault tolerance

○ memory-based computing

2. The ecosystem of Spark (BDAS, Chinese: Belk Analysis Stack)

○ MapReduce is one of the Hadoop ecosystems, while Spark is one of the BDAS ecosystems.

○ Hadoop includes MapReduce, HDFS, HBase, Hive, Zookeeper, Pig, Sqoop, etc.

○ BDAS includes Spark, Shark (equivalent to Hive), BlinkDB, Spark Streaming (message real-time processing framework, similar to Storm), and so on.

○ BDAS ecosystem map:

3. Spark and MapReduce

Advantages:

○ MapReduce usually puts the intermediate results on HDFS. Spark is based on memory parallel big data framework, and the intermediate results are stored in memory, which is efficient for iterative data Spark.

○ MapReduce always consumes a lot of time sorting, and some scenarios do not need sorting, Spark can avoid unnecessary sorting overhead

○ Spark is a directed acyclic graph (a topology that starts from a point and eventually cannot return to that point) and optimizes it.

4. API supported by Spark

Scala, Python, Java, etc.

5. Operation mode

○ Local (for testing, development)

○ Standlone (stand-alone cluster mode)

○ Spark on Yarn (Spark on Yarn)

○ Spark on Mesos (Spark in Mesos)

6. Runtime Spark

The Driver program starts multiple Worker,Worker to load data from the file system and generates RDD (that is, the data is put into RDD, RDD is a data structure) and Cache into memory according to different partitions. As shown in the figure:

8. Basic concepts of fault-tolerant Lineage8.1 and fault-tolerant

Each RDD of ○ records the parent RDD it depends on. Once some partition of a RDD is lost, it can be quickly recovered by parallel computing.

8.2. Narrow Dependent (narrow dependency) and Wide Dependent (wide dependency)

The dependency of ○ RDD can be divided into Narrow Dependent (narrow dependency) and Wide Dependent (wide dependency).

○ narrow dependency: each partition can only be used by one RDD. Since there are no multiple dependencies, the partition can be processed at one time on a node, and can be quickly recovered from the previous RDD in the event of data loss or corruption.

○ wide dependency: each partition can be used by multiple RDD. Due to multiple dependencies, the next step cannot be processed until all the data arriving at the node has been processed. Once the data is lost or corrupted, it will be finished. Therefore, before this occurs, the data of all nodes must be materialized (stored on disk) to achieve recovery.

Example diagram of ○ wide and narrow dependencies:

9. Caching strategy

Spark consists of 11 cache strategies through useDisk, useMemory, deserialized and replication4 parameters.

UseDisk: using disk caching (boolean)

UseMemory: using memory caching (boolean)

Deserialized: deserialization (serialization is to transfer objects for the network, boolean:true deserialization\ false serialization)

Replication: number of copies (int)

It is controlled by passing parameters through the construction of the StorageLevel class, which is structured as follows:

Class StorageLevel private (useDisk: Boolean, useMemory: Boolean, deserialized: Boolean, replication:Ini)

10. Method of submission

○ spark-submit (official recommendation)

○ sbt run

○ java-jar

Various parameters can be specified when submitting

. / bin/spark-submit-- class-- master-- deploy-mode-- conf =. # other options [application-arguments]

For example:

At this point, I believe you have a deeper understanding of "what is the way Spark is submitted". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.