The latest-the basic working principle of Spark 07/09 Update SLTechnology News&Howtos

The latest-the basic working principle of Spark

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

First, the basic concepts of Spark are introduced.

1. Resource management center of Cluster Manager:Spark cluster

1 > Standalone mode: Cluster Manager is the native resource manager of Spark, and the Master node is responsible for the allocation of resources

2 > Haddop Yarn mode: ResearchManager in Yarn is responsible for resource allocation in Cluster Manager

3 > Messos mode: Cluster Manager is managed by Messos Master in Messos.

2. A worker node in a Worker Node:Spark cluster that can run Application code.

3. Executor: a process running on the work node (Worker Node), responsible for performing specific tasks (Task), and responsible for storing data in memory or on disk.

4. Application:Spark Application is a program built by users on Spark

As shown in the figure:

1 > contains Driver and a batch of application-independent Executor processes

2 > each Application contains multiple job Job, each Job contains multiple Stage phases, and each stage contains multiple Task.

3 > Job: job. A Job contains multiple RDD and various operations acting on the corresponding RDD.

4 > Stage: phase is the basic scheduling unit of a job. A job is divided into multiple groups of tasks, each of which is called a "phase".

5 > TaskScheduler: task Scheduler

6 > Task: task, a unit of work running on Executor, is a thread in Executor.

5. Driver Program: driver

1 > run the main () method of the application Application and create the SparkContext

2 > Driver is divided into two parts: main () method and SparkContext.

6. DAG: is Directed Acyclic Graph (directed acyclic graph)

1 > is used to reflect the dependencies between RDD.

2 > the content of the work is shown in the figure:

7. DAGScheduler: directed acyclic graph scheduler

1 > divide Stage based on DAG and submit Stage to TaskScheduler in the case of TaskSet

2 > responsible for splitting the job into multiple batches of dependent tasks at different stages

3 > calculate the dependency relationship between jobs and tasks, and formulate scheduling logic.

4 > instantiated during the initialization of SparkContext, a SparkContext is created corresponding to a DAGScheduler

5 > the content of the work is shown in the figure:

Second, how does the Spark service perform the task?

The official process is shown in the figure:

Next, we will explain each step in detail:

1. The client submits a job: spark-submit

2. Create the application, run the main function of the Application application, create the SparkContext object (prepare the running environment of the Spark application), and be responsible for interacting with Cluster Manager.

3. Register with Cluster manager (Master) through SparkContext and apply for Executor resources that need to be run.

4. Cluster manager is allocated according to the resource allocation policy.

5. Cluster manager sends instructions to the assigned Worker Node to start the Executor process.

6. After receiving the instruction, Worker Node starts the Executor process.

7. The Executor process sends a heartbeat to Cluster manager.

8. The SparkContext of the Driver program is constructed into a DAG diagram.

9. SparkContext further decomposes the DAG diagram into Stage stages (that is, task set TaskSet).

10. SparkContext further sends the Stage phase (TaskSet) to the task scheduler (TaskScheduler).

11. Executor applies to TaskScheduler for a task (Task).

12. TaskScheduler issues Task tasks to Executor to run, and SparkContext issues Application application code to Executor at the same time.

13. Executor runs the application code, and Driver monitors the execution tasks.

14. After Execurot has finished running the task, send a signal to Driver that the task is completed.

15. Driver is responsible for shutting down SparkContext and sending a logout signal to Cluster manager.

15. After receiving the logout signal of Driver, Cluster manager sends a release resource signal to Worker Node.

16. The Executor program corresponding to Worker Node stops running and the resources are released.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.