The core theory of Spark Core 07/09 Update SLTechnology News&Howtos

The core theory of Spark Core

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

1. The core function of Spark Core (1) SparkContext:

generally speaking, the execution and output of DriverApplication is done through SparkContext, and the SparkContext needs to be initialized before the Application is formally submitted. SparkContext hides network communication, distributed deployment, message communication, storage capacity, computing power, cache, measurement system, file service, web service and so on. Application developers only need to use the API provided by SparkContext to complete the functional development.

one application-> one or more job- > one or more stage- > multiple task

The built-in DAGScheduler of SparkContext is responsible for creating Job and converting one application into multiple Job. (the corresponding partition of Job depends on how many action operators are executed in the application.)

The built-in TaskScheduler of SparkContext is responsible for dispatching tasks, sending them to the appropriate nodes, and executing task.

(2) storage system:

Spark gives priority to using the memory of nodes as storage, and slots consider using disks when there is insufficient memory, which greatly reduces the disk ID O, improves the efficiency of execution, and makes Spark suitable for real-time computing, streaming computing and other scenarios. In addition, Spark also provides a memory-centric highly fault-tolerant distributed file system Tachyon for users to choose from.

(3) Computing engine:

The computing engine consists of DAGScheduler and RDD in SparkContext and Map and Reduce tasks performed by Executor on specific nodes. Although DAGScheduler and RDD are located within the SparkContext, the RDD in the Job is organized into a directed acyclic graph (DAG) before the task is formally submitted and executed, and the partition of the Stage determines the number of tasks in the task execution phase, iterative calculation, shuffle and so on.

(4) deployment model:

single node is not enough to provide sufficient storage and computing capacity, so the Spark processed by big data provides the implementation of Standalone deployment mode and the support of YANR, Mesos and other distributed resource management systems in the TaskScheduler component of SparkContext. In addition to the deployment models such as Standalone, YARN, Mesos, kubernetes, Cloud, which can be used in the actual production environment, Spark also provides Local mode and local-cluster mode for development and debugging.

Standalone, YARN, Mesos, kubernetes, Cloud: for distributed production scenarios.

Local is used for local testing.

2. Spark cluster architecture:

The manager of the Cluster Manager:spark cluster is mainly responsible for your allocation and management of resources. The resources allocated by cluster management belong to first-level allocation, which allocates memory, CPU, and other resources on each worker to the application, but does not allocate resources to executor. At present: Standalone, YARN, Mesos, K8SMagneEC2, etc., can all be used as Spark cluster managers.

The primary node of the Master:spark cluster.

For the work node of the Worker:Spark cluster, for spark applications, the worker assigned by the cluster manager is mainly responsible for the following tasks: creating the Executor, further assigning resources and tasks to the Executor, and synchronizing resource information to the Cluster Manager.

Executor: the process that performs computing tasks. Mainly responsible for performing tasks and synchronizing information with Worker and Driver Application. (mainly initializing a thread pool and scheduling threads to perform corresponding computing tasks)

Driver Appication: the client driver, which can also be understood as the client application, is used to convert the task program into RDD and DAG, and to communicate and schedule with Cluster Manager, where the SparkContext object is created in Driver Appication.

Deploy mode: the mode of deployment, when the written code is submitted to the cluster to run:-- master specifies the manager of the resource.

One of the common ones is yarn. When you submit a task using yarn, you usually have to configure another parameter:-- deploy-mode client/cluster has client and cluster.

Client, which means to put the Driver program on the submission node.

Cluster means to put the Driver program on any node in the cluster. The Driver program determines where the SparkContext object is created.

The relationship between the roles:

Spark computing platform has two important roles, Driver and executor, whether in StandAlone mode or YARN mode, Driver acts as the master role of Application, responsible for task execution plan generation and task distribution and scheduling; executor acts as worker role, responsible for the actual execution of the task task, the calculation results return Driver.

3. Spark programming model: (1) the flow of spark applications from writing to output:

-users write Driver Application programs using API provided by SparkContext (commonly used are textFile, sequenceFile, runJob, stop, etc.).

-user applications that are submitted using the sparkContext object will first broadcast the resource configuration for the task using BlockManager and BroadcastManager. The DAGScheduler then converts the task to RDD and organizes it into DAG, and the DAG is also divided into different Stage. Finally, TaskScheduler submits the task to the cluster manager (ClusterManager) with the help of ActorSystem.

-Cluster Manager (ClusterManager) assigns resources to tasks, that is, specific tasks are assigned to Worker, and Worker creates an Executor to handle the running of tasks. Standalone, YARN, Mesos, kubernetes, EC2, etc., can all be used as cluster managers for Spark.

(2) spark calculation model:

RDD can be seen as a unified abstraction of various data computing models. The calculation process of Spark is mainly the iterative calculation process of RDD, as shown in the figure above. The iterative calculation process of RDD is very similar to that of pipelines. The number of partitions depends on the setting of the number of partition, and the data for each partition is calculated in only one Task. All partitions can be executed in parallel on the Executor of multiple machine nodes.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.