The Architecture and Operation Logic of Spark 07/06 Update SLTechnology News&Howtos

The Architecture and Operation Logic of Spark

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article focuses on "the architecture and running logic of Spark". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Next, let the editor take you to learn "the architecture and running logic of Spark".

One: the architecture of Spark.

1. Driver: run the main () function of Application and create a SparkContext.

2. Client: the client where the user submits the job.

3. Worker: any node in the cluster that can run Application code, running one or more Executor

Process.

4. Executor: the Task executor running on Worker, Executor starts the thread pool to run Task, and

Responsible for storing data in memory or on disk. Each Application applies for its own Executor to handle the task.

5. SparkContext: the context of the whole application, which controls the life cycle of the application.

6. RDD: the basic computing unit of Spark, a set of RDD forms an executed directed acyclic graph RDD Graph.

7. DAG Scheduler: build a Stage-based DAG workflow based on Job and submit the Stage to TaskScheduler.

8. TaskScheduler: distribute the Task to Executor for execution.

9. SparkEnv: thread-level context that stores references to important components at run time.

Second, run the logic.

1. Spark job submission process.

As shown in the figure below, Client submits an application. Master finds a Worker to launch Driver,Driver to apply for resources from Master or resource manager, then converts the application into RDD directed acyclic graph, and then DAGScheduler transforms RDD directed acyclic graph into Stage directed acyclic graph and submits it to TaskScheduler, and TaskScheduler submits the task to Executor for execution. In the process of task execution, other components work together to ensure the smooth execution of the whole application.

2. Spark job running logic.

As shown in the following figure, in Spark applications, the whole execution process forms a directed acyclic graph between logical operations. After the Action operator is triggered, all the accumulated operators are formed into a directed acyclic graph, and then the tasks on the graph are scheduled by the scheduler for operation. The scheduling method of Spark is different from that of MapReduce. Spark splits into different phases (Stage) according to the different dependencies between RDD. A phase contains a series of functions for pipelined execution. A, B, C, D, E, F in the figure represent different RDD, and a box in RDD represents a data block. The data is input from HDFS to Spark to form RDD An and RDD C, and the map operation is performed on RDD C.

Convert to RDD D, RDD B, and RDD E are converted to F by join, and in the process from B to F

Conduct Shuff le. Finally, RDD F is saved to HDFS through the function saveAsSequenceFile output.

At this point, I believe you have a deeper understanding of "the architecture and running logic of Spark". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.