How to analyze the underlying principles of Spark 04/28 Update SLTechnology News&Howtos

How to analyze the underlying principles of Spark

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to analyze the underlying principles of Spark, I believe that many inexperienced people do not know what to do. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

Introduction to Spark

Apache Spark is a unified analysis engine for large-scale data processing. Based on memory computing, it improves the real-time performance of data processing in big data environment, ensures high fault tolerance and high scalability, and allows users to deploy Spark on a large number of hardware to form clusters.

The Spark source code has grown from 40w lines of 1.x to more than 100w lines now, and more than 1400 Daniel have contributed the code. The whole Spark framework source code is a huge project. Let's take a look at the underlying implementation principle of spark.

Spark running process

The specific operation process is as follows:

SparkContext registers with the resource manager and applies to the resource manager to run Executor

The resource manager allocates Executor, and then the resource manager starts Executor

Executor sends a heartbeat to the resource manager

Constructing DAG directed acyclic graphs by SparkContext

Decompose DAG into Stage (TaskSet)

Send Stage to TaskScheduler

Executor applies for Task from SparkContext

TaskScheduler sends Task to Executor to run

At the same time, SparkContext issues the application code to Executor.

Task runs on Executor. After running, all resources are released.

1. From the code point of view, the construction of DAG diagram Val lines1 = sc.textFile (inputPath2) .map (...) map (...) Val lines2 = sc.textFile (inputPath3) .map (...) Val lines3 = sc.textFile (inputPath4) Val dtinone1 = lines2.union (lines3) Val dtinone = lines1.join (dtinone1) dtinone.saveAsTextFile (...) dtinone.filter (.) foreach (...)

Build DAG diagrams

The Spark kernel draws a directed acyclic graph of the computing path when the computation occurs, that is, the DAG shown in the figure above.

The calculation of Spark occurs in the Action operation of RDD, while all Transformation,Spark before Action only records the track generated by RDD and does not trigger the actual calculation.

two。 Partition DAG into Stage core algorithm

An Application can have multiple job and multiple Stage:

In Spark Application, many job can be triggered because of different Action. There can be many job in an Application, and each job is composed of one or more Stage. The subsequent Stage depends on the previous Stage, that is, only after the previous dependent Stage is calculated, the subsequent Stage will run.

Division based on:

Stage partition is based on wide dependencies, such as reduceByKey,groupByKey operators, which will lead to wide dependencies.

Review the principles of dividing wide and narrow dependencies:

Narrow dependency: one partition of the parent RDD will only be dependent on one partition of the child RDD. That is, an one-to-one or many-to-one relationship can be understood as an only child. Common narrow dependencies are: map, filter, union, mapPartitions, mapValues, join (parent RDD is hash-partitioned), and so on.

Wide dependency: a partition of the parent RDD will be covered by multiple partition dependencies of the child RDD (involving shuffle). That is, an one-to-many relationship can be understood as super-birth. Common broad dependencies are groupByKey, partitionBy, reduceByKey, join (parent RDD is not hash-partitioned), and so on.

Core algorithm: backtracking algorithm

Backtracking / reverse parsing from back to front, when you encounter narrow dependencies to join this Stage, and encounter wide dependencies for Stage segmentation.

The Spark kernel pushes back and forth from the RDD that triggered the Action operation, first creating a Stage for the last RDD, and then going backwards. If you find a wide dependency on a RDD, a new Stage is created for the wide dependent RDD, and that RDD is the last RDD of the new Stage.

Then, by analogy, continue backwards, dividing the Stage according to narrow or wide dependencies until all the RDD is traversed.

3. Dividing DAG into Stage profiling

DAG partition Stage

A Spark program can have multiple DAG (there are several Action, there are several DAG, at the end of the figure, there is only one Action (not shown in the figure), then it is a DAG).

A DAG can have multiple Stage (divided by wide dependency / shuffle).

Multiple Task can be executed in parallel with the same Stage (number of task = number of partitions, as shown above, there are three partitions P1, P2, P3 in Stage1, and corresponding three Task).

You can see that only the reduceByKey operation in this DAG is a wide dependency, which is used by the Spark kernel as a boundary to divide it into different Stage.

At the same time, we can notice that in the Stage1 in the figure, there are narrow dependencies from textFile to flatMap to map, and these steps can form a pipeline operation. The partition generated by the flatMap operation can proceed with the map operation without waiting for the end of the whole RDD operation, which greatly improves the efficiency of the calculation.

4. Submit Stages

The submission of the scheduling phase will eventually be converted into the submission of a task set. DAGScheduler submits the task set through the TaskScheduler API, which will eventually trigger TaskScheduler to build an instance of TaskSetManager to manage the lifecycle of the task set. For DAGScheduler, the work of the submission scheduling phase is completed.

On the other hand, the specific implementation of TaskScheduler will further schedule specific tasks to the corresponding Executor nodes through TaskSetManager when the computing resources are obtained.

Overall task scheduling

Features of Spark running architecture 1. Executor process specific

Each Application gets its own Executor process that resides during the Application and runs Tasks in a multithreaded manner.

Spark Application cannot share data across applications unless the data is written to an external storage system. As shown in the figure:

Support for multiple resource managers

3. Job submission proximity principle

The Client submitting the SparkContext should be close to the Worker node (the node running Executor), preferably in the same Rack (rack), because there is a large amount of information exchange between SparkContext and Executor during Spark Application operation.

If you want to run in a remote cluster, it's best to use RPC to submit the SparkContext to the cluster, and don't run SparkContext away from Worker.

4. Principle execution of mobile programs rather than mobile data

The principle of mobile programs rather than mobile data is implemented. Task adopts the optimization mechanism of data locality and speculative execution.

Key methods: taskIdToLocations, getPreferedLocations.

After reading the above, have you mastered how to parse the underlying principles of Spark? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.