The core principle and architecture of Spark 07/01 Update SLTechnology News&Howtos

The core principle and architecture of Spark

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article mainly explains "the core principle and architecture of Spark". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn the core principles and architecture of Spark.

Spark RDD characteristics

RDD (Resilient Distributed Datasets), flexible distributed data sets, is a memory abstraction of distributed data sets, which provides fault tolerance through limited shared memory. At the same time, this memory model makes computing more efficient than traditional data flow models. RDD has five important features, as shown in the following figure:

1. A set of partitions, the basic units of a dataset.

two。 Calculate the function for each data partition.

3. Dependence on parent RDD, which describes the lineage (lineage) between RDD.

4. Optionally, for the key-value pair RDD, there is a Partitioner (usually HashPartitioner,RangePartitioner).

5. Optionally, a set of Preferred location information (for example, the location information where the Block of the HDFS file is located).

Spark operating Mode Analysis-Overview

When a user submits a task to Spark for processing, the following two parameters together determine how Spark runs. -master MASTER_URL: determines the cluster to which Spark tasks are submitted for processing. -deploy-mode DEPLOY_MODE: determines the operation mode of Driver. Available values are Client or Cluster.

Characteristics of Spark operating Architecture

Each Application gets its own executor process that resides during the Application and runs Tasks in a multithreaded manner. This Application isolation mechanism has its advantages, whether from a scheduling perspective (each Driver schedules its own tasks) or from an operational perspective (Task from different Application runs in different JVM). Of course, this also means that Spark Application cannot share data across applications unless the data is written to an external storage system.

Spark has nothing to do with resource managers, as long as you can get Executor processes and keep communicating with each other.

The Client that submits the SparkContext should be close to the Worker node (the node running Executor), preferably in the same rack, because there is a lot of information exchange between SparkContext and Executor during Spark Application running; if you want to run in a remote cluster, it is best to use RPC to submit the SparkContext to the cluster and not to run SparkContext far away from Worker.

Exclusive name:

Application: every Spark program is called an Application.

Driver: each Spark program runs a Driver process, which is used to coordinate and count the progress of tasks.

Worker: each Spark program runs multiple Worker processes, which can run on one or more nodes and contain multiple executor child processes.

Executor: each Spark program runs multiple Executor processes, specifically undertaking computing tasks.

Standalone operation mode

Spark Standalone mode, that is, stand-alone mode, comes with complete services and can be deployed separately to a cluster without relying on other resource management systems (resource management + resource scheduling). In this mode, users can start a separate cluster by manually starting Master and Worker. Among them, Master plays the role of resource management and Workder acts as the role of computing node. In this mode, the Spark Driver program runs on the client Client, while Executor runs on the Worker node.

Standalone component analysis the whole cluster is divided into Master node and Worker node, in which the Driver program runs on the client.

The .Master node is responsible for allocating computing resources on the Worker node to the task, and the two synchronize the resource status by communicating with each other, as shown in the red two-way arrow on the way.

. After the client starts the task, it runs the Driver program, and the Driver program initializes the SparkContext object and registers with Master.

. There will be one or more ExecutorBackend processes on each Workder node. Each process contains an Executor object that holds a thread pool, and each thread pool can execute a task task. The ExecutorBackend process is also responsible for communicating with the Driver program on the client node and reporting the task status.

Spark Standalone task running process

The above process reflects the overall interaction between the client, Master, and Workder nodes of Spark in standalone mode. The specific running process of a task needs to be decomposed in more detail, which can be seen in the small print in the figure.

1. The user starts the Driver process of the application through the bin/spark-submit deployment tool or bin/spark-class, and the Driver process initializes the SparkContext object and registers with the Master node.

The 2.Master node accepts the registration of the Driver program, checks the Worker nodes it manages, and allocates the required computing resource Executor to the Driver program. After the Worker node finishes allocating the Executor, it reports the status of the Executor to the Master.

After the ExecutorBackend process on the 3.Worker node starts, register with the Driver process.

After the division of tasks is completed by DAG Schaduler,Stage Schaduler,Task Schaduler and other processes within the 4.Driver process, the TASK is assigned to the ExecutorBackend on the Worker node.

5.ExecutorBackend does the TASK calculation and reports the TASK status to Driver until the end.

The 6.Driver process logs out to the Master after all the TASK has been processed.

Analysis of Spark Operation Mode-Standalone

Summary of Spark Standalone schemas

Spark can run in standalone mode, which is provided by Spark itself. Users can start a separate cluster by manually starting the master and worker processes, or they can run these daemons on a single machine for testing. Standalone model can be used in production environment, which effectively reduces the cost of learning and testing Spark framework.

The standalone mode currently only supports simple FIFO scheduling across applications. However, to allow multiple concurrent users, you can control the maximum number of resources used by each application. By default, it requests the use of the entire CPU kernel of the cluster.

By default, standalone task scheduling allows worker to fail (in which case it can transfer failed tasks to other worker). However, the scheduler uses master for scheduling, which creates a single point of problem: if master crashes, new applications will not be created. To solve this problem, zookeeper's election mechanism can start multiple master in the cluster, or you can use local files to achieve single-node recovery.

Analysis of Spark Operation Mode-Cluster

Spark Cluster mode task running process

1. Users submit applications to the Yarn cluster through the bin/spark-submit deployment tool or bin/spark-class.

The Resource Manager of the 2.Yarn cluster selects a Node Manager node for the submitted application, assigns the first container, and launches the SparkContext object on the container of that node.

The 3.SparkContext object requests resources from the Resource Manager of the Yarn cluster to run Executor.

The Resource Manager of .Yarn cluster assigns container to SparkContext objects, SparkContext and related Node Manager communication, starts the ExecutorBackend daemon on the acquired container, and starts registering with SparkContext and applying for Task after ExecutorBackend starts.

5.SparkContext assigns Task to ExecutorBackend execution.

6.ExecutorBackend begins to implement Task and report its operation status to SparkContext in a timely manner.

When the 7.Task is finished, SparkContext returns the resources to Node Manager and logs out.

Summary of Spark Cluster schemas

Spark can be run as a cluster, and the optional clusters here are Yarn and Mesos. In cluster mode, the Driver program of Spark may be scheduled to any node, and after the task is executed, the resources allocated by the cluster will be reclaimed.

At this point, I believe you have a deeper understanding of "the core principles and architecture of Spark". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.