What is the principle of SparkContext in Spark2.x? 07/06 Update SLTechnology News&Howtos

What is the principle of SparkContext in Spark2.x?

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article is to share with you about the principle of SparkContext in Spark2.x, the editor thinks it is very practical, so I share it with you to learn. I hope you can get something after reading this article.

TaskScheduler initialization, Application to SparkMaster nodes, and Executor reverse registration, etc. (core)

DAGScheduler creation and initialization

Creation and initialization of SparkUI interface

Next, I will explain the principle and loading process of SparkContext in detail with the source code. Here, we use Spark2.x in our production environment. Here, we will explain the source code of Spark2.2.0. Here we quote a principle analysis diagram of SparkContext on the Internet:

I. TaskScheduler:

In the source code SparkContext.scala, first call the function createTaskScheduler () to create TaskScheduler

In createTaskScheduler, the code in the corresponding mode will be created according to your submission mode. Different submission modes will create different TaskScheduler. Here we will explain it in standalone mode:

The function createTaskScheduler first creates a TaskSchedulerImpl (which is actually TaskScheduler), then creates SparkDeploySchedulerBackend (it is controlled by TaskSchedulerImp at the bottom, and is actually responsible for registering with Master, unregistering Executor, sending Task to Executor, etc.), and then calls the initialize () method of TaskSchedulerImpl, as shown in the following code:

The last line of code calls the function buildPools to create the scheduling pool according to different scheduling policies.

After the creation of TaskScheduler and DAGScheduler, call the start () function of TaskScheduler to start. In fact, the start () function of SchedulerBackend is called inside the function.

In the start () function, some parameters submitted by the user are first initialized from the spark-submit command line, such as driverUrl, extraJavaOpts, classPathEntries, libraryPathEntries, and so on. To create an ApplicationDescription instance with these parameters, this ApplicationDescription is very important. It represents all the information about the application submitted by the current user, including the maximum amount of CPU Core required by the application and the amount of memory needed on each slave. Finally, create an APPClient instance, and since this is the Standalone schema, create an StandaloneAppClient instance that is responsible for communicating with the Spark cluster for application. It receives a URL of Spark Master, an application, and a listener of cluster events, as well as callback functions of listeners when various events occur, as shown below:

At this point, the TaskScheduler startup is complete, and the waitForRegistration () function is called to wait for the registration to complete.

II. The creation of DAGScheduler

The DAGScheduler class implements the high-level scheduling layer of the stage-oriented scheduling mechanism, code location:

The creation of DAGScheduler mainly does the following:

1)。 Each job calculates a DAG (directed acyclic graph) of stage, and stage is divided according to action.

2)。 Track the stage output of RDD and whether it is written to storage media such as disk or memory

3)。 Looking for a consumption (optimal, minimum) scheduling mechanism to run job

4)。 Responsible for encapsulating the stage into Taskset and submitting it to TaskSchdulerImpl, and running a batch of task through the cluster. Note here: each batch of task runs the same code and only processes different parts of the data, so distributed computing is reflected here.

5)。 Responsible for the best location for each task to run, and submit these best locations to the TaskSchdulerImpl according to the current cache status

6)。 If it fails to deal with the loss of file output caused by shuffle, the stage will be resubmitted; if it is not due to the failure caused by the internal shuffle, such as OOM, it will be handled by TaskSchdulerImpl, retry each task many times, if it fails in the end, cancel the stage run, and finally the whole app will hang up.

III. The creation of SparkUI

Here is the last step of SparkContext initialization. Call the function createLiveUI in SparkUI to create the interface. Port 4040 is bound by default to display the running status of Application. Here, a jetty server is started to display the web page, and the code location:

Here register a listening SparkListenerBusts, that is, all spark message SparkListenerEvents are sent to it asynchronously. The main functions of this class are as follows

1)。 Message queues are saved, which are responsible for message caching

2)。 Keep the registered listener, responsible for the distribution of messages

Supplement the three scheduling strategies commonly used in yarn:

1) .FIFO Scheduler:

The applications are arranged into a queue in the order of submission, which is a first-in, first-out queue. When allocating resources, resources are allocated to the top application in the queue first, and then to the next one after the top application needs are met, and so on.

2) Fair Scheduler:

in the Fair scheduler, we do not need to occupy a certain amount of system resources in advance, the Fair scheduler will dynamically adjust the system resources for all running job. As shown in the figure below, when the first large job is submitted, only this job is running, and it acquires all the cluster resources; when the second small task is submitted, the Fair scheduler allocates half of the resources to the small task, allowing the two tasks to share cluster resources fairly.

3). Capacity Scheduler:

For Capacity scheduler, there is a special queue for running small tasks, but setting a special queue for small tasks takes up a certain amount of cluster resources in advance, which causes the execution time of large tasks to lag behind that of using FIFO scheduler.

These are the principles of SparkContext in Spark2.x. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.