What is the principle of Spark computing? 07/12 Update SLTechnology News&Howtos

What is the principle of Spark computing?

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Today, I would like to talk to you about what the principle of Spark computing is, many people may not know much about it. In order to make you understand better, the editor summarizes the following content for you. I hope you can get something according to this article.

Hadoop's MR structure and YARN structure are the first generation of products in big data's era, which meet the needs of offline computing, but there are deficiencies in real-time computing. In order to meet this demand, later bosses developed spark computing methods, which greatly improved the efficiency of computing.

The calculation principle of Spark

The structure of spark is:

Node introduction:

Cluster Manager: in standalone mode, it is the Master master node, which controls the entire cluster and monitors the worker. In YARN mode, it is responsible for allocating resources to the resource manager, which is a bit like the role of ResourceManager in YARN. The butler holds all the working resources and belongs to Party B.

WorkerNode: the node that can work, dispatched by the butler ClusterManager, is the one who really has the resources to work. Slave node, responsible for controlling the compute node, start Executor or Driver.

Executor: a process that starts on WorkerNode, the equivalent of a contractor, responsible for preparing the Task environment and execution.

Task: responsible for memory and disk usage. Task is every specific task in the construction project.

Driver: running the main () function of Application, which controls the generation of Task and sends it to Executor, is the commander of Party A.

SparkContext: the person who deals with ClusterManager and is responsible for applying for resources for money is the interface of Party A.

The whole interaction process goes like this:

Party A came to a project, created a SparkContext,SparkContext to find ClusterManager application resources and give a quotation, how much CPU and memory resources are needed. ClusterManager goes to WorkerNode and starts Excutor, and introduces Excutor to Driver.

Driver splits batches of Task according to the construction drawings and sends the Task to Executor for execution.

After receiving the Task, Executor prepares the Task runtime to rely on and execute, and returns the execution result to Driver.

Driver will continue to direct the next step according to the returned Task status until all Task execution is finished.

The operation process and characteristics are as follows:

The role of Sparkcontext: one is to distribute task, apply for resources and other functions, a more important function is to split RDD into task, that is, drawing DAG diagram.

Using the figure above, let's take a look at the operation of spark:

Build the running environment of Spark Application and start SparkContext

SparkContext applies to the resource manager (which can be Standalone,Mesos,Yarn) to run the Executor resource and starts StandaloneExecutorbackend

Executor applies for Task from SparkContext

SparkContext distributes applications to Executor

SparkContext builds into DAG diagrams, decomposes DAG diagrams into Stage, sends Taskset to Task Scheduler, and finally Task Scheduler sends Task to Executor to run

Task runs on Executor and releases all resources after running

RDD calculation case

We use a case to analyze the calculation process of RDD:

Build a RDD graph on the client side through RDD, as shown in the first part of the figure, rdd1.join (rdd2). Groupby (…) .filter (...) .

The DAGScheduler in sparkcontext builds the RDD graph from the previous step into a DAG graph, as shown in the second part of the figure.

TaskScheduler splits the DAG drawing into multiple Task

Clustermanager assigns the Task to the Executer of each node through the Yarn scheduler and calculates with related resources.

DAGScheduler has certain rules for the division of RDD graphics:

Stage is partitioned from back to front when action is triggered, so this figure starts with RDD_G.

RDD_G relies on RDD_B and RDD_F, randomly deciding which dependency to judge first, but has no effect on the result.

RDD_B and RDD_G are narrow dependencies, so they belong to the same stage,RDD_B and daddy RDD_A have a wide dependency relationship, so they can't be divided together, so RDD_A itself is a stage1.

RDD_F and RDD_G belong to wide dependencies, so they cannot be divided together, so the scope of the last stage is limited. RDD_B and RDD_G constitute Stage3.

There is a narrow dependency between RDD_F and two dads, RDD_D and RDD_E, and between RDD_D and daddy RDD_C, so they all belong to the same stage2.

In the process of execution, stage1 and stage2 are not related to each other, so they can be executed in parallel, and the corresponding task of each partition within each stage is also executed in parallel.

Stage3 relies on the partition of stage1 and stage2 to execute the result, and stage3 can be started only after the execution of the first two stage is finished.

As we mentioned earlier, there are two kinds of Task for Spark: ShuffleMapTask and ResultTask, in which the latter is pushed to Executor in the last phase of DAG, and ShuffleMapTask is pushed in all other stages. In this case, ShuffleMapTask is generated in both stage1 and stage2, and ResultTask in stage3.

Although the partition of stage is calculated from back to front, the actual creation of stage after relying on logical judgment is from back to back. In other words, if the ID of the stage is used as the identity, the ID of the stage that needs to be executed first is less than the ID that needs to be executed later. In this case, the ID of stage1 and stage2 is smaller than stage3, and the ID of stage1 and stage2 is random, which is determined by the previous step 2.

Executor is the coolie who finally runs task. He feeds back the execution result of Task to Driver and adopts different strategies according to size:

If it is greater than MaxResultSize, it defaults to 1G and is discarded directly.

If "large", it is larger than the configured frameSize (default is 10m), and save it to BlockManager with taksId as key

Else, vomit it all to Driver.

After reading the above, do you have any further understanding of the principle of Spark computing? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.