Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

RDD consanguinity source code detailed explanation!

2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

First, the dependence of RDD

The dependencies of RDD can be divided into two types: wide dependency and narrow dependency. We can think of it this way:

(1) narrow dependency: the partition of each parent RDD is used by at most one partition of child RDD. (2) wide dependency: each parent RDD partition is used by the partition of multiple child RDD.

The generation of partition with a narrow dependency on each child RDD can be done in parallel, while a wide dependency requires all the parent RDD partition shuffle results to be obtained.

II. Org.apache.spark.Dependency.scala source code parsing

Dependency is an abstract class:

/ / Denpendency.scalaabstract class Dependency [T] extends Serializable {def rdd: RDD [T]}

It has two subclasses: NarrowDependency and ShuffleDenpendency, which correspond to narrow and wide dependencies, respectively.

(1) NarrowDependency is also an abstract class

The abstract method getParents is defined, and the input partitionId is used to get all the partitions of a partition dependent parent RDD of the child RDD.

/ / Denpendency.scalaabstract class NarrowDependence [T] (_ rdd: RDD [T]) extends Dependency [T] {/ * Get the parent partitions for a child partition. * @ param partitionId a partition of the child RDD * @ return the partitions of the parent RDD that the child partition depends upon * / def getParents (partitionId: Int): Seq [Int] override def rdd: RDD [T] = _ rdd}

Narrow dependency has two more concrete implementations: OneToOneDependency and RangeDependency.

(a) OneToOneDependency means that the partition of child RDD depends on only one partition of parent RDD, and the operators that generate OneToOneDependency are map,filter,flatMap and so on. You can see that the getParents implementation is very simple, that is, pass in a partitionId, and then put the partitionId in the List and pass it out.

/ / Denpendency.scalaclass OneToOneDependency [T] (rdd: RDD [T]) extends NarrowDependency [T] (rdd) {override def getParents (partitionId: Int): List [Int] = List (partitionId)} (b) RangeDependency refers to child RDD partition's one-to-one dependence on parent RDD partition to a certain extent, mainly for union. / / Denpendency.scalaclass RangeDependency [T] (rdd: RDD [T], inStart: Int, outStart: Int, length: Int) extends NarrowDependency [T] (rdd) {/ / inStart represents the starting index of parent RDD OutStart represents the starting index of child RDD override def getParents (partitionId: Int): List [Int] = {if (partitionId > = outStart & & partitionId < outStart + length) {List (partitionId-outStart + inStart) / / represents the relative position of the current index} else {Nil}} (2) ShuffleDependency width dependence

Indicates that the partition of a parent RDD will be used multiple times by the partition of child RDD. It takes shuffle to form.

/ / Denpendency.scalaclass ShuffleDependency [K: ClassTag, V: ClassTag, C: ClassTag] (@ transient privateval _ rdd: RDD [_

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report