Spark2.x from shallow to deep comparison between RDD api reduceByKey and foldByKey in Series 6 07/02 Update SLTechnology News&Howtos

Spark2.x from shallow to deep comparison between RDD api reduceByKey and foldByKey in Series 6

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Before learning any knowledge of spark, it is best to correctly understand spark. You can refer to: correctly understand spark.

I. Preface

For the two api of key-value type RDD, reduceByKey and foldByKey, we often simply know that the difference is that foldByKey has one more initial value than reduceByKey, but it is not enough to know that this is not enough. We still do not know how to use the two api reasonably, so it is necessary to make a detailed comparison between the two api. Let's start with the other two action api reduce and fold of RDD, and then slowly analyze the usage scenarios of reduceByKey and foldByKey

II. RDD action Api-reduce and fold

Let's compare the two api from the following two aspects

1. Apply these two Api to the empty RDD, as follows:

/ / create an empty RDDval emptyRdd = sc.emptyRDD [int] / / A pair of empty RDD will report the following error / / java.lang.UnsupportedOperationException: empty collection// at org.apache.spark.rdd.RDD$$anonfun$reduce$1 $$anonfun$apply$36.apply (RDD.scala:1027) emptyRdd.reduce (_ + _) / / A pair of empty RDD for fold will not, but return 0emptyRdd.fold (0) (_ + _) / / res1: Int = 0

2. Distinguish the two api from the number of temporary objects generated.

/ / create a RDD,RDD of type ArrayBufferval testRdds = sc.parallelize (Seq (ArrayBuffer (0,1,3), ArrayBuffer (2,4,5) / / reduce a pair of testRdds as follows: / / generate a lot of intermediate temporary objects because ArrayBuffer + + ArrayBuffer creates a new ArrayBuffer object testRdds.reduce (_ + _) / a pair of testRdds fold, as follows: / / ArrayBuffer is initialized only once, each time ArrayBuffer append is transferred to the previous ArrayBuffer There is no intermediate temporary object testRdds.fold (ArrayBuffer.empty [int]) ((buff, elem) = > buff + + = elem)

As can be seen from the above, fold operations will produce very few intermediate objects, and if there are too many intermediate objects, it will lead to frequent gc, which will affect performance, so when using reduce and fold, we need to consider whether there will be a lot of temporary objects. If so, can we use fold operations to avoid producing too many intermediate objects?

3. Key-value type RDD api-- reduceByKey and foldByKey

We also consider it from the following two aspects:

1. When the two Api are applied to an empty RDD, the behavior is the same, and both return 0

Val emptyKeyValueRdd = sc.emptyRDD [(Int, Int)] emptyKeyValueRdd.reduceByKey (_ + _). Collect / / 0emptyKeyValueRdd.foldByKey ((0)) (_ + _). Collect / / 0

2. Distinguish the two api from the number of temporary objects generated.

/ / build a key-value type RDDval testPairRdds = sc.parallelize (Seq (("key1", ArrayBuffer (0,1,3)), ("key2", ArrayBuffer (2,4,5)), ("key1", ArrayBuffer (2,1,3) / / A pair of testPairRdds applications reduceByKey, if the same key has a lot of data, it will also generate a lot of temporary objects. Because ArrayBuffer + + ArrayBuffer will create a new ArrayBuffer object testPairRdds.reduceByKey (_ + _). Collect () / a pair of testPairRdds application foldByKey, for each key, the ArrayBuffer is initialized only once, each time the ArrayBuffer append is transferred to the previous ArrayBuffer, and the intermediate temporary object testPairRdds.foldByKey (ArrayBuffer.empty [Int]) ((buff, elem) = > buff + + = elem). Collect ()

As can be seen from the above, the application scenarios of foldByKey and reduceByKey are actually the same as those of fold and reduce, so when intermediate temporary objects are generated, we need to choose between foldByKey and reduceByKey.

For the principles of other api of RDD and the points that should be paid attention to when using them, you can refer to: detailed explanation of the principle of spark core RDD api

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.