An example Analysis of Key- value Type Transformation operator in spark RDD operator 07/19 Update SLTechnology News&Howtos

An example Analysis of Key- value Type Transformation operator in spark RDD operator

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly shows you the "spark RDD operator in the Key- value-type Transformation operator example analysis", the content is easy to understand, clear, hope to help you solve the doubt, the following let Xiaobian lead you to study and learn the "spark RDD operator in the Key- value-type Transformation operator example analysis" this article.

Key- value type Transformation operator

The data processed by Transformation is an operator in the form of Key-Value, which can be roughly divided into three types: one-to-one input partition and output partition, aggregation and join operation.

1. One-to-one input partition and output partition

MapValues (f)

Map operations are performed on the Value in the (Key, Value) data without processing the Key.

The boxes in figures 3-19 represent the RDD partition. A = > astat2 means to add 2 to only 1 of the data (V1), and the return result is 3.

2. Gather for a single RDD or two RDD

(1) single RDD aggregation

1) combineByKey (createCombiner, mergeValue, mergeCombiners, numPartitions=None, partitionFunc=)

Description:

CreateCombiner: v = > C, in cases where C does not exist, such as creating seq C through V.

MergeValue: (C, V) = > C, when C already exists, merge is required, such as adding item V to seq C, or overlaying.

MergeCombiners: (C, C) = > C, merge the two C.

Partitioner: Partitioner (divider). Shuffle needs to be partitioned through Partitioner's partitioning policy.

MapSideCombine: Boolean = true. In order to reduce the transmission volume, many combine can be done first on the map side. For example, an overlay can first overlay the Value of all the same Key in a partition, and then shuffle.

SerializerClass: String = null, the transport needs to be serialized, and the user can customize the serialization class.

For example, it is equivalent to transforming a RDD whose element is (Int,Int) into an RDD of type (Int, Seq [Int]).

The boxes in figure 3-20 represent the RDD partition. Through combineByKey, merge (V1,2), (V1,1) data into (V1, Seq (2,1)).

2) reduceByKey (func, numPartitions=None, partitionFunc=)

ReduceByKey is a simpler case, but two values are merged into one value, so createCombiner is very simple, just return v directly, while mergeValue and mergeCombiners have the same logic and no difference.

The box in figure 3-21 represents the RDD partition. Through the user-defined function (A, B) = > (A + B), the value of the data (V1,2) and (V1,1) of the same Key are added, and the result is (V1,3).

3) partitionBy (numPartitions, partitionFunc=)

The partitionBy function partitions the RDD.

If the partition of the original RDD is consistent with the existing partition (partitioner), the partition is not re-partitioned, and if not, it is equivalent to generating a new ShuffledRDD according to the partition.

The box in figure 3-22 represents the RDD partition. Through the new partition strategy, the V1 and V2 data from different partitions are merged into one partition.

(2) aggregate the two RDD

Cogroup (other, numPartitions=None)

The cogroup function makes a cooperative division of the two RDD, aggregates the elements of the same Key in each RDD into a set for the elements of type Key-Value in the two RDD, and returns the iterator of the set of elements corresponding to the Key in the two RDD. Where Key and Value,Value are tuples composed of iterators of two data sets of the same Key under two RDD.

The large box in figure 3-23 represents the RDD, and the small box in the large box represents the partition in the RDD. The data in RDD1 (U1,1), (U1,2) and the data in RDD2 (U1,2) are merged into (U1, (1,2), (2)).

3. Connect

(1) join

Join operates the cogroup function on two RDD that need to be connected. The principle of cogroup can be found above. The new RDD formed after the cogroup operation performs Cartesian product operation on the elements under each key, flattens the returned result, forms a set corresponding to all the tuples under the Key, and finally returns RDD [(K, (V, W))]

Figure 3-24 is a schematic diagram of the join operation for two RDD. The large box represents the RDD and the small box represents the partition in the RDD. The function is Key for elements that have the same Key (for example, V1), so that the connected data results are (V1, (1)) and (V1, (1)).

(2) leftOutJoin and rightOutJoin

LeftOutJoin (left outer join) and RightOutJoin (right outer join) are equivalent to determining whether the RDD element on one side is empty on the basis of join, and if so, the fill is empty. If not empty, the data is concatenated and the result is returned.

The above is all the contents of the article "example Analysis of Key- value-type Transformation operators in spark RDD operators". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.