Spark kernel architecture decryption (13) 07/15 Update SLTechnology News&Howtos

Spark kernel architecture decryption (13)

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This issue mainly introduces the kernel architecture of spark. Generally speaking, the application of spark consists of two parts, one is Driver, which includes SparkConf and SparkContext, and the other is Executor, which runs specific business logic.

There are two ways to submit an application

1. The Driver process runs on the client side to monitor the application.

2. The master node assigns a worker node to launch Driver, which is responsible for the monitoring of the entire application.

Driver generally runs on a machine dedicated to submitting spark programs, which must be in the same network environment as spark cluster (because Driver communicates with Execuotr frequently, essentially CorarseGrainExecutorBackend), and the configuration is the same as that of ordinary Worker nodes. You can run the program through spark-submit, while you can specify various parameters to run, such as memory, cores.... The actual production environment writes shell scripts to automatically configure and submit programs. Of course, the current machine must have Spark installed, but the current machine does not belong to the cluster.

The core of Driver is SparkContext, while SparkContext depends on SparkConf. SparkContext creates DAGScheduler, TaskScheduler, and SchedulerBackend when initializing.

During the instantiation process, the application is registered with Master, and Master accepts registration, and if there is no problem, Master allocates AppId and computing resources to the current application. Master accepts the program submitted by the user and sends instructions to Worker to allocate computing resources for the current application. By default, each Worker node assigns an Executor to the current program, which is executed concurrently through the thread pool in Executor. When the Worker node receives the LaunchExecutor instruction sent by Master, it creates an ExecutorRunner instance and calls the start method to start the CoarseGrainExecutorBackend process. There is Executor in the CoarseGrainExecutorBackend process, and CoarseGrainExecutorBackend and Executor correspond one to one. A thread pool is maintained inside Executor. In actual work, task is encapsulated through TaskRunner, and then a thread is obtained from ThreadPool to execute task. After execution, the thread is reclaimed and reused.

In spark, the Transformation operation is deferred, and job is triggered only when there is an action operator. SparkContext divides the DAG composed of RDD in job into different stage through DAGScheduler, and each stage is composed of a series of Tasks with the same business logic but different processing data to form TaskSet.

TaskScheduler and SchedulerBackend are responsible for the implementation of the specific task (following data locality).

An application may contain multiple stage, and the Task in the last Stage is called ResultTask, which produces job results. The Task in the other previous Stage is called ShuffleMapTask and serves as the data input to the next Stage, which is equivalent to the Mapper in MapReduce.

The operation of the whole Spark program is that DAGScheduler divides the job into different Stage, submits the TaskSet to the TaskScheduler, and then submits it to the Executor for execution (in accordance with the data locality). Each Task calculates a Partition in the RDD and executes a series of functions within the same Stage based on the Partition. Until the whole program is run.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.