What is the principle of Spark assignment? 07/12 Update SLTechnology News&Howtos

What is the principle of Spark assignment?

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article introduces you to the principle of Spark homework, the content is very detailed, interested friends can refer to it, I hope it can help you.

What is the Spark operating principle?

The YARN cluster manager will launch a certain number of Executive processes on each worker node based on the resource parameters we set for the Spark job, each Executive process occupying a certain amount of memory and CPU core.

After requesting the resources needed to execute the job, the Driver process will start scheduling and executing the job code we wrote.

The Driver process splits the Spark job code we write into multiple stages, each stage executes a piece of code, creates a batch of tasks for each stage, and then distributes these tasks to individual Executor processes for execution.

A task is the smallest unit of computation that performs exactly the same computational logic (i.e., a snippet of code we wrote ourselves), except that each task processes different data.

After all tasks of a stage have been executed, the intermediate results of the calculation will be written to the local disk file of each node, and then the Driver will schedule the next stage to run.

The input data of the task of the next stage is the intermediate result of the output of the previous stage. And so on and so forth, until we write our own code logic all executed, and calculate all the data, we get the results we want until.

Spark is the division of stages according to shuffle-like operators. If our code executes a shuffle-like operator (such as reduceByKey, join, etc.), then a stage boundary will be drawn at that operator.

It can be roughly understood that the code before the execution of the shuffle operator is divided into one stage, and the code after the execution of the shuffle operator is divided into the next stage.

Therefore, when a stage is first executed, each task of it may pull all the keys that need to be processed by itself from the node where the task of the previous stage is located through the network transmission, and then use our own operator function to perform aggregation operations on all the same keys pulled (such as the function received by the reduceByKey() operator). This process is called shuffle.

When we perform persistence operations such as cache/persist in our code, the data calculated by each task will also be saved to the memory of the Executor process or to the disk file of the node where it is located, depending on the persistence level we choose.

Therefore, the memory of the Executor is mainly divided into three blocks:

The first block is used when task executes code we wrote ourselves, and the default is 20% of the total memory of the Executor;

The second block is to let the task pull the output of the task of the previous stage through the shuffle process, and use it when performing aggregation operations. By default, it also accounts for 20% of the total memory of the Executor.

The third block is used when making RDD persistent, and by default accounts for 60% of the total memory of the Executor.

The execution speed of a task is directly related to the number of CPU cores per Executive process. A CPU core can execute only one thread at a time. The multiple tasks assigned to each Executor process are all multithreaded and run concurrently in the form of one thread per task.

If the number of CPU cores is sufficient and the number of tasks allocated is reasonable, then in general, these task threads can be executed quickly and efficiently.

What is the principle of Spark homework to share here, I hope the above content can be of some help to everyone, you can learn more knowledge. If you think the article is good, you can share it so that more people can see it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.