Talking about the Internal Operation Mechanism of Spark 07/01 Update SLTechnology News&Howtos

Talking about the Internal Operation Mechanism of Spark

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

What are the most important mechanisms in Spark?

1. RDD _ 2. Spark scheduling mechanism, 3Shuffle process

What is RDD?

It can be said that if you understand RDD, you can basically eat half of Hadoop and Spark, so it's RDD in the end.

RDD (flexible distributed dataset) first embodies the dataset, and RDD is the encapsulation of the original data, which can logically partition the data. Secondly, the distributed embodiment is parallel computing and the problem of fault tolerance needs to be solved, that is, according to the dependency, the first layer RDD is found. Finally, according to the RDD number and partition number, the corresponding block number of the partition can be uniquely determined. The data corresponding to the partition can be extracted from the storage medium. In terms of flexibility, RDD adjusts the partition structure of parallel computing units (this may be Stage) without changing the internal stored data records.

Basic concept

(1) Application: a user-built Spark application that contains the driver (code for a Driver function) and Executor code that runs on multiple working nodes in the cluster.

(2) driver: an application that contains the main entry function and instantiates the SparkContext object within the main function is called the driver application. No, go straight to the code as follows:

Var logFile= "YOUR_SPARK_HOME/README.md" / / Local file directory

Val conf=new SparkConf () .setAppName ("Simple Application"); / / name Application

Val sc=new SparkContext (conf)

(3) Master (ClusterManager) manages the whole cluster. At present, Spark mainly supports three types: Standlone mode, Mesos mode and Yarn mode.

(4) Worker node: a cluster node running the Worker daemon.

(5) Task executor (Executor): there may be multiple Executor on a Worker node, and each Executor has a fixed number of cores and stack size.

(6) job: parallel computing consisting of multiple Task (tasks) (those partitions side by side) is often triggered by the action of Spark. Submit Job to the Spark cluster through the runJob method in Spark

(7) Phase (Stage): each job is split into multiple Task sets because of the dependency between RDD, which is called Stage, and each Task set can also be called TaskSet (task set).

Add:

There may be multiple job in each Application, independent of each other.

Each Worker can have one or more Executor.

Each Executor consists of several core, and each core of each Executor can only execute one Task at a time.

The result of each Task execution is to generate a partiton of the target RDD.

How to understand how to rely on parallel computing?

4.1 Partition is the basic unit of parallel computing: if a raw data is divided into 10 partitions, then the 10 partitions can be paralleled at the same time. Not necessarily, if they are all narrow dependencies, there will be no problem, but wide dependencies will be involved, which will result in the intersection of data between partitions, which is not as fast as completing these 10 partitions at the same time.

4.2Computing data in each partition is treated as a parallel task, each parallel task contains a computing chain, and each CPU core executes these computing links. Direct, simple, straightforward, do not play virtual, on the code to understand the calculation chain:

Rdd.map (line= > line.length). Filter () Wait a minute or something.

If these computing chains are independent and do not affect each other, then we can calculate in parallel. We can define the relationship between these chains as narrow dependencies (one-to-one and scope dependencies).

Why does RDD divide Stage and how does it divide stage?

If the data in a child RDD depends on the data of multiple partitions in a parent RDD, which is called wide dependency, or Shuffle dependency, then if there are multiple child RDD, and each child RDD depends on the data of partitions in multiple parent RDD, should we find a way to save the RDD data and provide it for these child partitions to calculate, otherwise each partition will have to recalculate multiple parent RDD data? Also began to divide the reasons for Stage in this place. Whenever you encounter wide dependencies, you will divide stage.

How does Spark manage resources?

There are three types of Spark cluster managers: Standlone mode, Mesos mode and Yarn mode. This is the key point, but it is not very important, so the place is not well understood and there is not much loss.

How to schedule inside Spark?

DAGScheduler is a Stage-oriented task scheduler, which is responsible for receiving Job submitted by Spark applications, dividing Stage according to RDD dependencies, and submitting Stage to TaskScheduler

TaskScheduler is a task scheduler for Task. It accepts the TaskSets submitted by DAGScheduler, and then submits each Task to the Work node to run. What Task each Executor runs is also assigned here.

The most important thing is this picture:

(1) any Spark application contains Driver and Executor code. The Spark application first initializes the SparkContext in Driver. Because SparkContext is the only way for Spark applications to get to the cluster. Two schedulers are included in SparkContext, one is DAGScheduler and TaskScheduler, which are automatically created while creating the SparkContext object.

(2) after initializing the SparkContext, first apply for the required resources from the Cluster Master according to the relevant configuration of the Spark, and then initialize the corresponding Executor at each Worker node. After the Executor initialization is completed, Driver will parse the RDD code in the Spark application to generate the corresponding RDD graph (RDD diagram), which describes the relevant information of the RDD and the dependency between them. That is, the first part of the picture, these RDD Objects

(3) after the RDD diagram is constructed, the Driver will be submitted to DAGScheduler for parsing. In the process of parsing RDD graph, DAGScheduler will do reverse parsing when it encounters Action operator. According to the dependency relationship between RDD and the existence of Shuffle, RDD graph will be parsed into a series of Stage with successive dependency relationship. The Stage is divided into shuffle, that is, if there is a dependency between the two RDD, the DAGScheduler will split the RDD into two Stage for execution, and only after the execution of the previous Stage, the latter Stage will be executed.

(4) A series of Stage (TaskSet) divided by DAGScheduler will be submitted to the underlying scheduler TaskScheduler for execution according to the order of Stage.

(5) after receiving the stage task of DAGScheduler, TaskScheduler will build a TaskSetManager instance in the cluster environment to manage the life cycle of Stage (TaskSet).

(6) TaskSetManager will send the relevant computing code and data resource files to the corresponding Executor, and start thread pool execution on the corresponding Executor.

(7) in the process of Task execution, some applications may involve the input and output of Driver 0, in which each Executor is managed by the corresponding BlockManager, and the relevant BlockManager information will be interacted and synchronized with the Blocktracker in the Driver.

(8) in the process of TaskThreads execution, if there is a running error or other problems that affect the execution of the Worker, TaskSetManager will try 3 times by default. After all failed attempts, it will be reported to the TaskScheduler,TaskScheduler. If it cannot be solved, the DAGScheduler,DAGScheduler will resubmit it to the other Executor for execution according to the operation of each Worker node.

(9) after the execution of the TaksThread is completed, the result of the execution will be fed back to the TaskSetManager,TaskSetManager and fed back to the TaskScheduler,TaskScheduler. The DAGScheduler,DAGScheduler will submit the loop iteration to the TaskScheduler for execution according to whether there is still a Stage to be executed.

(10) after all the Stage has been executed, it will finally achieve the goal of the application, either output to a file, or display on the screen, etc., the current running process of Driver ends, waiting for other instructions from the user or shut down.

(11) when the user shows that the SparkContext is turned off, the whole running process ends, and the related resources are either released or recycled.

The operation form of Spark is conducive to the resource scheduling between different Application, which also means that different Application can not communicate with each other and exchange information.

Driver is responsible for scheduling all tasks, so he should be as close to the Worker node as possible and end up in the same network.

What is the process of 10.Shuffle?

Only when the data of all partitions of the parent RDD in the Shuffle dependency is calculated and stored will the child RDD start pulling the required partition data. Here, the whole process of data transmission is called the Shuffle process of Spark. In the Shuffle process, the process of calculating a partition data until the data is written to disk is called the Shuffle write process. Correspondingly, the process of pulling the required data from the parent RDD during the calculation of a partition of the child RDD is called the Shuffle read process.

Whether Spark or Hadoop, there are many similarities in the process of dealing with shuffle, and some concepts can be applied directly. For example, in the shuffle process, the side that provides the data is called the map side, and the task generated by the map side is called mapper. Correspondingly, the end that receives the data is called the reduce side, and each task of pulling the data on the reduce side is called reducer. The essence of the Shuffle process is the process of dividing the data obtained by the map side using a divider and sending the data to the corresponding reducer.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.