How does Spark work 07/08 Update SLTechnology News&Howtos

How does Spark work

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "how Spark works". Friends who are interested might as well take a look. The method introduced in this paper is simple, fast and practical. Now let the editor take you to learn how Spark works.

01 Overview

With more and more subdivision requirements from application fields, people have constructed a variety of technical frameworks for subdivision fields step by step from the initial HDFS and MapReduce of Hadoop. Some specialize in batch processing scenarios, some specialize in data warehouse scenarios, some specialize in flow computing scenarios, and some specialize in full-time machine learning.

In my opinion, it's a bit like patching Hadoop, because Hadoop didn't consider so many scenarios at the beginning of its design, it's just to support offline batch processing. But the demand is here, and in order to achieve the goal, we have to start a new stove to meet the demand by designing a new system. This situation has caused a lot of problems.

Repetitive work: different systems need to solve some of the same common problems, such as distributed execution and fault tolerance. For example, MapReduce, SQL query engine, and machine learning systems all involve aggregation operations.

Combination: the combination between different systems is very "expensive" because there is no effective number of efficacy between different systems. In order to use it together, we need to frequently export and import data between different systems, and it may take more time for data to move than it takes to calculate.

Maintenance cost: although these systems are excellent from the perspective of each individual, they are designed and implemented by different teams in different periods, and their design ideas and implementation methods are also different. This makes it very painful for the platform to deploy these systems because they are so different.

Learning costs: the huge differences between systems are especially true for developers. These technical frameworks have different logical objects, technical terms, API, and programming models, and each framework needs to be relearned before it can be used.

Spark is aware of this problem and has a natural advantage as a rising star. Spark was born in 2012, when the ecology of Hadoop has gone through six years of development, and its ecological pattern has taken shape. Spark has been able to see what subdivisions big data has, while MapReduce, Hive, Storm and other open source components have been developed for many years, and Spark can also understand their strengths and weaknesses.

So Spark came out of nowhere and became the most popular distributed memory computing engine in the open source community. Spark uses the DAG (directed acyclic graph) model as its execution model, and mainly uses memory computing for task calculation.

Based on a set of unified data model (RDD) and programming model (Trans-foration / Action), Spark constructs Spark SQL, Spark Streaming, Spark MLibs and other branches, and its functions cover many areas of big data, as shown in figure 2-14.

▲ figure 2-14 areas covered by Spark

Spark constructs several branch libraries such as SQL query, flow computing, machine learning and graph computing through a unified data model and programming model.

02 data model

RDD is the abbreviation of resilient distributed data set (Resilient Distributed Datasets). It is an extension and extension of the MapReduce model. The reason why Spark can support many areas of big data at the same time, to a large extent, depends on the ability of RDD.

Although the computing scenarios of batch processing, flow computing, graph computing and machine learning seem to be irrelevant at first, they all have a common requirement that data can be shared efficiently in the parallel computing phase.

The designers of RDD penetrated this phenomenon and designed RDD through efficient data sharing concepts and MapReduce-like operations, enabling it to simulate a variety of programming models such as iterative algorithms, relational queries, MapReduce and streaming.

At the same time, it is also a fault-tolerant and parallel data structure, which allows users to specify that the data is stored to disk and memory, and to control the partition of the data. At the same time, it also provides some efficient programming interfaces to manipulate datasets.

03 programming model and job scheduling

Spark divides the operations of RDD into two categories: transformation and action.

The conversion operation is a lazy operation that only defines the new RDD and does not execute immediately. The action is to perform the calculation immediately, either returning the result to the Driver process or outputting the result to external storage. Common conversion operations such as map, flatMap, filter, etc., common action operations such as count, collect, etc.

When the user performs an action on a RDD, the scheduler generates a DAG (directed acyclic graph) graph to execute the program based on the RDD dependency. DAG consists of several stage, and each stage contains multiple consecutive narrow dependencies. On the other hand, there is a wide dependence between each stage. As shown in figure 2-15, the solid box represents RDD. The rectangle in the box represents the partition and is represented in black if the partition has been saved in memory.

▲ figure 2-15 Spark task split diagram

04 dependence

As a data structure, RDD is essentially a read-only collection of partition records. A RDD can contain multiple partitions, each of which is a piece of data.

RDD can be interdependent. If each partition of the parent RDD is used by a partition of one child RDD at most, it is called narrow dependency; if multiple child RDD partitions depend on the partition of one parent RDD, it is called wide dependency. Different operations may have different dependencies depending on their characteristics. For example, map operations produce narrow dependencies, while join operations produce wide dependencies.

Spark divides dependencies into two types for two reasons. First, narrow dependencies are supported in the form of pipes on the same single cluster, such as performing map followed by filter. In contrast, wide dependencies require that all parent RDD data be available and continue through the shuffle action.

Second, narrow-dependent failure recovery is more efficient because it only needs to recalculate the lost parent partition, and these calculations can be performed in parallel on different nodes at the same time. In contrast, in a wide-dependent inheritance relationship, a single failed node may result in the loss of some partitions in all the ancestral RDD of an RDD, resulting in the re-execution of the calculation. As shown in figure 2-16, the difference between narrow and wide dependencies is illustrated.

▲ figure 2-16 SparkRDD wide and narrow dependency schematic

05 fault tolerance

There are two fault-tolerant schemes for traditional distributed systems: data replication and log recovery. Both approaches are expensive for data-centric systems because they require large amounts of data to be replicated across cluster networks, and the speed of network bandwidth is much slower than the speed of memory access.

RDD is inherently fault tolerant. First, it is itself an immutable dataset, and secondly, Spark uses DAG as its execution model, so it can generate a DAG diagram through a series of operations remembered by the dependency characteristics of RDD. Therefore, when the executed task fails, Spark only needs to recalculate according to the DAG graph to implement the fault-tolerant mechanism. Because there is no need for replication to support fault tolerance, Spark reduces the cost of data transmission across the network.

06 cluster mode

The application of Spark runs on a cluster in the form of a set of independent processes, coordinated by SparkContext objects in the main program (also known as driver programs). Spark currently supports three cluster operation modes.

Specifically, Spark can run either independently in standlone mode or on Mesos or YARN.

As shown in figure 2-17, once SparkContext is connected to the cluster, Spark will first obtain some executor processes from the nodes in the cluster, which will be used to perform the calculation and storage logic in our program, and then it will distribute our program code to each executor process in the form of jar packages. Finally, SparkContext dispatches tasks to each executor process for execution.

▲ figure 2-17 Spark task progress diagram

Each application has its own executor process, which runs continuously throughout the application life cycle and performs specific tasks in a multithreaded manner. The advantage of this design is that resource consumption is isolated between applications, each running in its own JVM. But it also means that SparkContext between different applications cannot share data except with the help of extended storage media.

Spark is unaware of the underlying cluster management. As long as executor can be obtained and these processes can communicate with each other, it can easily run on other general cluster resource scheduling frameworks, such as Mesos and YARN.

07 usage scenario

With the excellent design of its RDD, Spark has achieved support across many fields. This means that we can integrate multiple operations in one set of program logic.

For example, use SQL queries to filter data, and then do machine learning or manipulate stream data through SQL. It not only improves the convenience but also reduces the learning curve of developers. Based on Spark, you only need to learn a set of programming models to deal with multiple fields.

Therefore, Spark is the most appropriate one-stop computing solution for the platform.

At this point, I believe you have a deeper understanding of "how Spark works". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.