What is the dependency of spark RDD? 07/09 Update SLTechnology News&Howtos

What is the dependency of spark RDD?

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Today, I will talk to you about what the dependency relationship of spark RDD is. Many people may not know much about it. In order to make you understand better, the editor has summarized the following content for you. I hope you can get something according to this article.

Basic concepts of dependency relationship

The dependency of RDD has a relationship similar to that between contexts, which also exists among various RDD operators. The relationship between two adjacent RDD is called dependency, and the relationship between multiple consecutive RDD is called consanguinity.

Every RDD preserves consanguinity, just like knowing who your father is and who your father's father is.

RDD does not save data, so when an operator goes wrong, in order to improve fault tolerance, it is necessary to find the data source through the dependency between operators, and then execute it sequentially, so as to re-read the calculation.

Def main (args: Array [String]): Unit = {val sparConf = new SparkConf () .setMaster ("local") .setAppName ("WordCount") val sc = new SparkContext (sparConf) val lines: RDD [String] = sc.makeRDD (List ("hello world") "hello spark") println (lines.toDebugString) println ("*") val words: RDD [String] = lines.flatMap (_ .split (")) println (words.toDebugString) println (" * ") val wordToOne = words.map (word= > (word) 1) println (wordToOne.toDebugString) println ("*") val wordToSum: RDD [(String, Int)] = wordToOne.reduceByKey (_ + _) println (wordToSum.toDebugString) println ("*") val array: Array [(String) Int)] = wordToSum.collect () array.foreach (println) sc.stop ()}

The output consanguinity log is as follows:

(1) ParallelCollectionRDD [0] at makeRDD at RDD_Dependence_01.scala:13 [] * (1) MapPartitionsRDD [1] at flatMap at RDD_Dependence_01.scala:16 [] | ParallelCollectionRDD [0] at makeRDD at RDD_Dependence_01.scala:13 [] * (1) MapPartitionsRDD [2] at map At RDD_Dependence_01.scala:19 [] | MapPartitionsRDD [1] at flatMap at RDD_Dependence_01.scala:16 [] | ParallelCollectionRDD [0] at makeRDD at RDD_Dependence_01.scala:13 [] * (1) ShuffledRDD [3] at reduceByKey at RDD_Dependence_01.scala:22 [] +-(1) MapPartitionsRDD [2] at map at RDD_Dependence_01.scala:19 [] | | MapPartitionsRDD [1] at flatMap at RDD_Dependence_01.scala:16 [] | ParallelCollectionRDD [0] at makeRDD at RDD_Dependence_01.scala:13 [] * |

Wide dependence and narrow dependence

Narrow dependency means that the partition data of the parent RDD is provided to the partition of only one corresponding child RDD.

Wide dependence

Wide dependency means that the partition data of the parent RDD is provided to the partitions of multiple corresponding child RDD. When the parent RDD has a Shuffle operation, the dependency between the parent RDD and the child RDD must be wide dependency, so it is also called Shuffle dependency.

Stage division

DAG (Directed Acyclic Graph) directed acyclic graph is a topological graph composed of points and lines, which has direction and will not be closed-loop. For example, DAG records the conversion process of RDD and the phase of the task.

The DAGScheduler part of the source code explains the phase division process of the task:

There is a parameter finalRDD in the handleJobSubmitted method. Through the finalStage = createResultStage (finalRDD, func, partitions, jobId, callSite) method, you can see that no matter how many RDD there are, a resultStage will be created by default through the final RDD.

Then createResultStage calls the getOrCreateParentStages (rdd: RDD [_], firstJobId: Int): List [Stage] method, and returns the chain structure of the dependency (ShuffleDependency storage map) through getShuffleDependencies (rdd: RDD [_]), such as A getOrCreateShuffleMapStage (shuffleDep, firstJobId)}. ToList} / * * Returns shuffle dependencies that are immediate parents of the given RDD. * * This function will not return more distant ancestors. For example, if C has a shuffle * dependency on B which has a shuffle dependency on A: * * A Job- > Stage- > Task each layer is an one-to-n relationship.

After reading the above, do you have any further understanding of the dependency of spark RDD? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.