What is the spark architecture like? 07/11 Update SLTechnology News&Howtos

What is the spark architecture like?

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

Editor to share with you how the spark architecture is, I believe most people do not know much about it, so share this article for your reference, I hope you can learn a lot after reading this article, let's go to know it!

Hadoop and spark

Hadoop, a big data processing technology, has a history of about ten years, and is regarded as the preferred solution for big data collection processing. MapReduce is an excellent solution for one-way computing, but it is not very efficient for use cases that require multipath computing and algorithms. Each step in the data processing process requires a Map phase and a Reduce phase, and if you want to take advantage of this solution, you need to convert all use cases into MapReduce schemas.

Before the next step begins, the job output data from the previous step must be stored in the distributed file system. As a result, replication and disk storage can slow down this approach. In addition, Hadoop solutions often include clusters that are difficult to install and manage. And in order to handle different big data use cases, you need to integrate a variety of different tools (such as Mahout for machine learning and Storm for streaming data processing).

If you want to do more complex work, you must concatenate a series of MapReduce jobs and execute them sequentially. Each job has a high latency, and the next job can not start until the previous job is completed.

Spark allows program developers to develop complex multi-step data pipelines using directed acyclic graphs (DAG). It also supports in-memory data sharing across directed acyclic graphs so that different jobs can process the same data together.

Spark runs on top of the existing Hadoop distributed file system (HDFS) to provide additional enhancements. It supports the deployment of Spark applications to existing Hadoop v1 clusters (with SIMR-Spark-Inside-MapReduce) or Hadoop v2 YARN clusters or even Apache Mesos.

We should think of Spark as a substitute for Hadoop MapReduce rather than as a substitute for Hadoop. The intention is not to replace Hadoop, but to provide a comprehensive and unified solution for managing different big data use cases and requirements.

Spark biosphere

Spark tries to integrate the fields of machine learning (MLib), graphic algorithm (GraphX), streaming computing (Spark Streaming) and data warehouse (Spark SQL), and constructs a new big data application platform through computing engine Spark and flexible distributed data set (RDD). Spark ecosphere uses HDFS, S3 and Techyon as underlying storage engines, Yarn, Mesos and Standlone as resource scheduling engines, Spark can be used to implement MapReduce applications, ad hoc queries can be realized based on Spark,Spark SQL, Spark Streaming can handle real-time applications, MLib can implement machine learning algorithms, GraphX can achieve graph computing, and SparkR can achieve complex mathematical calculations.

Spark Streaming:

Spark Streaming is based on micro-batch calculation and processing, and can be used to deal with real-time stream data. It uses DStream, which is simply a series of flexible distributed datasets (RDD) that handle real-time data.

Spark SQL:

Spark SQL can expose Spark datasets through JDBC API, and you can also use traditional BI and visualization tools to perform SQL-like queries on Spark data. Users can also use Spark SQL to ETL data in different formats (such as JSON,Parquet, databases, etc.), convert it, and then expose it to specific queries.

Spark MLlib:

MLlib is an extensible Spark machine learning library, which consists of general learning algorithms and tools, including binary classification, linear regression, clustering, collaborative filtering, gradient descent and underlying optimization primitives.

Spark GraphX:

GraphX is a new (alpha) Spark API for graph computation and parallel graph computation. By introducing elastic distributed attribute graph (Resilient Distributed Property Graph), a directed multigraph with attributes on vertices and edges, Spark RDD is extended. To support graph computing, GraphX exposes a basic set of operators (such as subgraph,joinVertices and aggregateMessages) and an optimized Pregel API variant. In addition, GraphX includes a growing collection of graph algorithms and builders for simplifying graph analysis tasks.

Basic concepts of spark

Client: the client process, which is responsible for submitting jobs to Master.

The concept of Application:Spark Application is similar to that in Hadoop MapReduce, which refers to a user-written Spark application that contains code for a Driver function and Executor code that runs on multiple nodes in a cluster.

Cluster Manager: refers to the external services that acquire resources on the cluster. Currently, there are:

Standalone:Spark native resource management, and Master is responsible for resource allocation

Hadoop Yarn: ResourceManager in YARN is responsible for the allocation of resources

In Master:Standalone mode, the master node is responsible for receiving jobs submitted by Client, managing Worker, and ordering Worker to start Driver and Executor.

Worker: any node in the cluster that can run Application code, similar to the NodeManager node in YARN. In Standalone mode, it refers to the Worker node configured through the Slave file, and in Spark on Yarn mode, it refers to the NodeManager node, which is responsible for managing the resources of this node, regularly reports the heartbeat to Master, receives commands from Master, and starts Driver and Executor.

Driver: the runtime of a Spark job includes a Driver process, which is also the main process of the job, which is responsible for parsing the job, generating Stage, and scheduling Task to the Executor. Including DAGScheduler,TaskScheduler.

Executor: where the job is actually executed, a cluster generally contains multiple Executor, each Executor receives the command Launch Task of Driver, and an Executor can execute one or more Task.

Job: parallel computing consisting of multiple Task, often generated by Spark Action. A JOB contains multiple RDD and various Operation acting on the corresponding RDD.

Stage: a Spark job typically contains one or more Stage.

Task: a Stage contains one or more Task, which can run in parallel through multiple Task.

DAGScheduler: the implementation divides the Spark job into one or more Stage. Each Stage determines the number of Partition according to the number of Task in the RDD, and then generates the corresponding Task set and puts it into the TaskScheduler.

TaskScheduler: implements the allocation of Task to Executor for execution.

SparkContext: the context of the entire application, which controls the life cycle of the application.

The basic computing unit of RDD:Spark, a set of RDD can form an executed directed acyclic graph RDD Graph.

SparkEnv: thread-level context that stores references to important components at run time.

Create and contain references to some of the following important components within the SparkEnv.

MapOutPutTracker: responsible for storing Shuffle meta-information.

BroadcastManager: responsible for controlling broadcast variables and storing meta-information.

BlockManager: responsible for storage management, creating, and finding blocks.

MetricsSystem: monitors runtime performance metrics information.

SparkConf: responsible for storing configuration information.

Spark architecture

The Spark architecture adopts the Master-Slave model in distributed computing. Master is the node with Master process in the corresponding cluster, and Slave is the node with Worker process in the cluster. Master, as the controller of the entire cluster, is responsible for the normal operation of the entire cluster; Worker is like a computing node, receiving commands and status reports from the master node; Executor is responsible for task execution; Client, as the user's client, is responsible for submitting applications, and Driver is responsible for controlling the execution of an application, as shown in the figure.

After the deployment of the Spark cluster, it is necessary to start the Master process and the Worker process on the master node and the slave node respectively to control the whole cluster. Driver and Worker are two important roles in the execution of a Spark application. Driver program is the starting point of application logic execution, which is responsible for job scheduling, that is, the distribution of Task tasks, while multiple Worker are used to manage computing nodes and create Executor parallel processing tasks. During the execution phase, Driver serializes the file and jar that Task and Task depend on and passes them to the corresponding Worker machine, while Executor processes the tasks of the corresponding data partitions.

The overall process of Spark is as follows: Client submits the application, Master finds a Worker to launch Driver,Driver to apply for resources from Master or resource manager, then converts the application into RDD Graph, and then DAGScheduler converts RDD Graph into Stage directed acyclic graph to submit to TaskScheduler, and TaskScheduler submits the task to Executor for execution. In the process of task execution, other components work together to ensure the smooth execution of the whole application.

1. When the cluster starts, each slave node (or worker) registers with the Master of the cluster, telling Master that I am ready to work on call.

2.Master monitors the status of each worker in the cluster in real time according to a heartbeat mechanism to see if it works properly.

3.Driver Application will also register information with Master when submitting jobs.

4. After the job is registered, Master will issue an Executor command to worker

5.worker generates several Executor ready for execution

6. Executor in each worker registers Executor information with Driver Application so that Driver Application can distribute jobs to specific Executor

7.Executor regularly reports current status updates to Driver Application.

8.Driver Application launch mission to Executor for execution

The above is all the content of this article "what is the spark architecture?" Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.