What is the basic workflow of spark? 07/08 Update SLTechnology News&Howtos

What is the basic workflow of spark?

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

Editor to share with you how the basic workflow of spark is, I believe most people do not know much about it, so share this article for your reference, I hope you can learn a lot after reading this article, let's go to know it!

Introduction

The application of Spark is divided into two parts: task scheduling and task execution. All Spark applications are inseparable from SparkContext and Executor. Executor is responsible for executing tasks, the machine running Executor is called Worker node, SparkContext is started by user programs, and communicates with Executor through resource scheduling module.

Specifically, with SparkContext as the total entry of the program, in the initialization process of SparkContext, Spark will create two-level scheduling modules of DAGScheduler job scheduling and TaskScheduler task scheduling respectively.

The job scheduling module is a high-level scheduling module based on task phase, which calculates multiple scheduling phases with dependency for each Spark job (usually divided according to shuffle), then constructs a set of specific tasks for each stage (usually considering the locality of data, etc.), and then submits it to the task scheduling module in the form of TaskSets (task group) for concrete execution. The task scheduling module is responsible for starting the task, monitoring and reporting the operation of the task.

The running architecture of the spark application:

(1) to put it simply:

Driver applies for resources from the cluster, allocates resources to the cluster, and starts executor. Driver transfers the code and files of the spark application to executor. Run task on executor, and then return the results to driver or write to the outside world.

(2) to put it more complicated:

Submit the application, build the sparkContext, build the DAG diagram, submit it to the scheduler for parsing, parse it into stage, submit it to the cluster, schedule it by the cluster task manager, and start the spark executor in the cluster. Driver passes the code and files to executor. Executor performs various operations to complete the task task. The block tracker on driver records the blocks of data generated by executor on each node. After the task is run, the data is written to HDFS or other types of databases.

(3) to put it comprehensively:

The spark application performs various transformation calculations and finally triggers the job through action. After the submission, the DAG diagram is first constructed through sparkContext according to the dependency relationship of RDD, and the DAG diagram is submitted to DAGScheduler for parsing. When parsing, the shuffle is used as the boundary, reverse parsing, and there is also a dependency relationship between the construction stage,stage. This process is to analyze the stage of the DAG diagram and calculate the dependency relationship between each stage. Then submit each TaskSet to the underlying scheduler, in spark, submit it to taskScheduler for processing, generate TaskSetmanager, and finally submit it to executor for calculation, executor multithreaded calculation, feedback to TaskSetmanager after calculation, feedback to taskScheduler, and then feedback back to DAGScheduler. Write data after all running.

(4) more in-depth understanding:

After the application is submitted, action is triggered, sparkContext is built, DAG diagram is built, submitted to DAGScheduler, stage is built, stageSet is submitted to TaskScheduler, taskSet Manager is built, and task is submitted to executor to run. When executor finishes running task, it submits the completion information to schedulerBackend, which submits the task completion information to TaskScheduler. TaskScheduler gives feedback to TaskSetManager, deletes the task task, and performs the next task. At the same time, TaskScheduler inserts the completed result into the success queue and returns the information of success after joining. TaskScheduler sends information about the success of the task to TaskSet Manager. After all the tasks are completed, TaskSet Manager will feed back the results to DAGScheduler. If it belongs to resultTask, give it to JobListener. If it does not belong to resultTask, save the results.

The spark job is submitted to execution

SparkContext submits the job through DAGScheduler's runJob. Then DAGScheduler divides Job into Stage,Spark, divides Stage according to the dependency of RDD, and finally encapsulates it into taskset for submission. The TaskScheduler class is responsible for the allocation of task scheduling resources, and SchedulerBackend is responsible for communicating with Master and Worker to collect the resources allocated to the application on Worker. Executor is responsible for the execution of specific tasks.

Task submission process of Driver

1. The code of the Driver program runs to the action operation, which triggers the runJob method of SparkContext.

2. SparkContext calls the runJob function of DAGScheduler.

3. DAGScheduler divides the Job into stage, then converts the stage into the corresponding Tasks, and gives the Tasks to TaskScheduler.

4. Add Tasks to the task queue through TaskScheduler and give it to SchedulerBackend for resource allocation and task scheduling.

5. The scheduler assigns execution Executor,ExecutorBackend to Task to be responsible for executing Task.

The above is all the contents of this article "what is the basic workflow of spark?" Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.