Perspective of Spark Job (23) from the Angle of physical execution 07/15 Update SLTechnology News&Howtos

Perspective of Spark Job (23) from the Angle of physical execution

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Think again about pipeline

Even with pipeline, function f operates on data sets in dependent RDD in two ways:

1, f (record), f acts on each record of the set, only one record at a time

2, f (records), f acts on all the data of the set at once

The adoption of Spark is the first approach for the following reasons:

1. Without waiting, you can maximize the use of computing resources in the cluster.

2. Reduce the occurrence of OOM.

3. Maximizing concurrency

4, can accurately control each Partition itself (Dependency) and its internal calculation (compute)

5. Operator flow functional programming based on lineage saves the generation of intermediate results and can be recovered as quickly as possible.

Second, think about the specific physical implementation of Spark Job

One or more Job can be generated in Spark Application. For example, when spark-shell is started by default, there is no internal Job, but as a resource allocation program, you can write code in spark-shell to generate several Job. Generally speaking, there can be different Action in ordinary programs, and each Action usually triggers a Job.

Spark is a more refined and efficient implementation of MapReduce. There are many different implementations of MapReduce. For example, the basic calculation flow of Hadoop's MapReduce is as follows: first, the execution of map in the concurrent execution of Mapper,Mapper with JVM as the object will produce output data, and the output data will be put into Local FileSystem through the rules specified by Partitioner, and then reduce will be executed to produce the final execution result when the input of reduce in Reducer is changed into Shuffle, Sort and Aggregate. Although the process executed by Hadoop MapReduce is simple, it is too rigid, especially when constructing complex algorithms (iterations), it is not conducive to the implementation of the algorithm, and the execution efficiency is extremely low!

The most basic core of Spark algorithm construction and physical execution: maximizing pipeline

According to the idea of Pipeline, computing begins when data is used. From the perspective of data flow, it is the flow of data to the location of computing. In essence, from a logical point of view, it is the flow of operators on the data.

From the point of view of algorithm construction: it must be the operator acting on the data, so the operator flows on the data.

From the point of view of physical execution: the flow of data to the location of the calculation

For pipeline, the location where the data is calculated is the last RDD in each stage.

Due to the Lazy characteristic of the calculation, the calculation is backtracked from back to front to form a Computing Chain. As a result, it is necessary to first calculate the Partition that this calculation depends on in the RDD on the left side of the Stage.

Third: the inside story of narrowly dependent physical execution

The RDD within a Stage is narrowly dependent, and the narrow dependency computing itself is logically calculated immediately from the leftmost RDD within the Stage. According to Computing Chain, the data (Record) flows from one calculation step to the next settlement step, and so on, until the last RDD within the Stage is calculated to produce the result.

The construction of the Computing Chain is built backwards, while the actual physical calculation is to allow the data to flow back on the operator until it reaches the point where it can no longer flow before the next Record is calculated. This leads to a good result: although the later RDD relies on the previous RDD, it does not require the parent RDD to calculate all the Records in the Partition before flowing the data, which greatly improves the calculation speed.

Four: wide dependence on physical execution

You must wait until all the data of the last RDD in the dependent parent Stage has been completely calculated before you can calculate the current Stage through shuffle!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.