How to understand the Stage,Executor,Driver of Spar in big data's development 07/19 Update SLTechnology News&Howtos

How to understand the Stage,Executor,Driver of Spar in big data's development

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to understand the Stage,Executor,Driver of Spar in the development of big data? aiming at this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.

1. Introduction.

For beginners of Spark, first of all, they do not understand the operating mechanism of Spark. When they communicate with you, they often do not know what they are talking about, such as deployment mode and running mode. For veterans with certain development experience, even if they know the operating mechanism, they may not quite understand the various terms of Spark. Therefore, understanding Spark terminology is a necessary way for Spark developers to communicate. Let's start with the operating mechanism of Spark and go to the WordCount case to understand the various terms in Spark.

The operating Mechanism of 2.Spark

First of all, take a picture of the official website to illustrate that it is the general implementation framework of spark applications on distributed clusters. It is mainly composed of sparkcontext (spark context), cluster manager (Resource Manager) and ▪ executor (the execution process of a single node). Cluster manager is responsible for the unified resource management of the whole cluster. Executor is the main process executed by the application, which contains multiple task threads and memory space.

The main operation flow of Spark is as follows:

After submitting using spark-submit, the application initializes sparkcontext, that is, the running environment of spark, at the corresponding location according to the parameter settings (deploy mode) at the time of submission, and creates DAG Scheduler and Task Scheduer,Driver. According to the execution code of the application, the whole program is divided into multiple job according to the action operator, and each job is built into DAG diagrams. DAG Scheduler divides DAG diagrams into multiple stage, and each stage is divided into multiple task,DAG Scheduler to transmit taskset to Task Scheduer. Task Scheduer is responsible for the scheduling of task on the cluster. As for the relationship between stage and task and how it is divided, we will talk about it in more detail later.

According to the resource requirements in sparkcontext, Driver requests resources from resource manager, including the number of executor and memory resources.

After receiving the request, the resource manager creates an executor process on the work node node that meets the criteria.

After the Executor is created, it will reverse register with driver so that driver can assign task to him for execution.

When the program is executed, driver logs out the requested resources to resource manager.

3. Understand the terms in Spark

In terms of operating mechanism, let's continue to explain the following terms

3.1 Driver program

Driver is the spark application we wrote, which is used to create sparkcontext or sparksession,driver will communicate with cluster mananer and assign task to executor for execution.

3.2 Cluster Manager

Responsible for the resource scheduling of the whole program, the main schedulers are:

YARN

Spark Standalone

Mesos

3.3 Executors

Executors is actually an independent JVM process, which is mainly used to execute task on each work node. Within an executor, multiple task can be executed in parallel at the same time.

3.4 Job

Job is a complete processing flow of the user program, which is called logically.

3.5 Stage

A Job can contain multiple Stage,Stage that are serial, and the trigger of State is generated by some shuffle,reduceBy,save actions.

3.6 Task

A Stage can contain multiple task, such as sc.textFile ("/ xxxx"). Map (). Filter (), where map and filter are a task respectively. The output of each task is the output of the next task.

3.7 Partition

Partition is part of the data source in spark. A complete data source is split into multiple partition by spark so that spark can send to multiple executor to execute tasks in parallel.

3.8 RDD

RDD is a distributed elastic data set. In spark, a data source can be regarded as a large RDD,RDD composed of multiple partition, and the data loaded by spark will be stored in RDD. Of course, it is actually cut into multiple partition inside RDD.

So the question is, how does a spark job work?

(1) the spark program we have written, also known as the driver, will submit a job to Cluster Manager.

(2) Cluster Manager will check the local rows of data and find the most suitable node to schedule tasks.

(3) job will be split into different stage, and each stage will be split into multiple task

(4) the driver sends task to executor to perform tasks.

(5) the driver tracks the execution of each task and updates it to the master node node, which we can view on spark master UI

(6) when the job is completed, the data of all nodes will be aggregated to the master node again, including the average time, the maximum time, the median and so on.

3.9 deployment model and operation mode

The deployment mode means that Cluster Manager generally has Standalone and Yarn, while the running mode refers to the running machine of Drvier, whether it is a cluster or a machine that submits tasks, corresponding to Cluster and Client mode, respectively. The difference lies in running results, logs, stability, and so on.

4. Understand the terms from the WordCount case and understand the relevant concepts again

Job:Job is triggered by Action, so a Job contains one Action and N Transform operations

Stage:Stage is a set of Task divided by shuffle operation, and Stage is divided according to its width dependency.

Task: minimum execution unit, because each Task is only responsible for the data of one partition

Processing, so there are generally as many Task as there are partitions. This kind of Task actually performs the same action on different partitions.

Here is a WordCount program

Import org.apache.spark. {SparkConf SparkContext} import org.apache.spark.rdd.RDDobject WordCount {def main (args: Array [String]): Unit = {val conf = new SparkConf (). SetMaster ("yarn"). SetAppName ("WordCount") val sc = new SparkContext (conf) val lines1: RDD [String] = sc.textFile ("data/spark/wc.txt") val lines2: RDD [String] = sc.textFile ("data/spark/wc2.txt") val J1 = lines1.flatMap (_ .split ("). Map ((_) ReduceByKey (_ + _) val J2 = lines2.flatMap (_ .split (")). Map ((_, 1)). ReduceByKey (_ + _) j1.join (j2). Collect () sc.stop ()}}

Yarn mode is widely used in production environment, so from the point of view of Yarn deployment mode, there is only one action in the code to operate collect, so there is only one Job, and Job is divided into three stage because of Shuffle. FlatMap and map and reduceBykey are considered as a Stage0, and the other line2 is another, Stage1, and Stage3 is the first two results join, and then collect, and stage3 depends on stage1 and stage0, but stage0 and stage1 are parallel. In the actual production environment, if you look at the dependency graph that depends on stage, you can clearly see the dependency relationship.

The answer to the question about how to understand the Stage,Executor,Driver of Spar in big data's development is shared here. I hope the above content can be of some help to everyone. If you still have a lot of doubts to be solved, you can follow the industry information channel to learn more about it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.