What are the categories of Spark RDD operators 04/27 Update SLTechnology News&Howtos

What are the categories of Spark RDD operators

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "which categories of Spark RDD operators are divided into", interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Next let the editor to take you to learn "Spark RDD operators are divided into which categories" it!

RDD operator classification can be roughly divided into two categories, namely:

1. Transformation: conversion operator. This kind of transformation does not trigger the submission of the job and completes the intermediate process of the job.

2. Action: action operators that trigger SparkContext to submit Job jobs.

One: Transformation: conversion operator

1. Map:

Transform each data item of the original RDD into a new element through the user-defined function f mapping in map. The map operator in the source code is equivalent to initializing a RDD, and the new RDD is called MappedRDD (this,sc.clean (f)). That is:

Map executes a specified function on each element in the RDD to generate a new RDD. Any element in the original RDD has and only one element corresponds to it in the new RDD.

Scala > val a = sc.parallelize (1 to 9,3) a: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [6] at parallelize at: 27scala > val b = a.map (x = > xylene 3) b: org.apache.spark.rdd.RDD [Int] = MapPartitionsRDD [7] at map at: 29scala > a.collectres7: Array [Int] = Array (1,2,3,4,5,6,7,8,9) scala > b.collectres8: Array [Int] = Array (3) 6, 9, 12, 15, 18, 21, 24, 27)

In the above example, each element in the original RDD is multiplied by 3 to produce a new RDD.

2. MapPartitions:

The mapPartitions function gets the iterator of each partition and manipulates the elements of the entire partition through the iterator of the partition as a whole. The internal implementation is to generate MapPartitionsRDD.

Scala > val a = sc.parallelize (1 to 9,3) a: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [10] at parallelize at: 27scala > a.collectres11: Array [Int] = Array (1,2,3,4,5,6,7,8 9) scala > var c = a.mapPartitions (a = > a.filter (_ > = 7)) c: org.apache.spark.rdd.RDD [Int] = MapPartitionsRDD [11] at mapPartitions at: 29scala > c.collectres12: Array [Int] = Array (7,8,9)

The above example is to filter all the data in the partition through the function filter.

3. MapValues

Operate on Value in type (key,value) data without processing Key. That is:

MapValues as its name implies is that the input function is applied to the Value of the Kev-Value in the RDD. The Key in the original RDD remains the same and forms the elements in the new RDD together with the new Value. Therefore, this function applies only to RDD whose elements are KV pairs.

Scala > val a = sc.parallelize (List ("Hadoop", "HBase", "Hive", "Spark"), 2) a: org.apache.spark.rdd.RDD [String] = ParallelCollectionRDD [12] at parallelize at: 27scala > val b = a.map (x = > (Int, String)) b: org.apache.spark.rdd.RDD [(Int, String)] = MapPartitionsRDD [13] at map at: 29scala > b.mapValues ("+ _ +"). Collectres14: Array [(Int, String)] = Array (5meme HBASE), (4je Hive), (5meme Spark))

4. MapWith:

MapWith is another variant of map. Map requires only one input function, while mapWith has two input functions.

Eg: multiply partition index by 10, then add 2 as the element of the new RDD. (3 is to divide ten numbers into three regions.)

Scala > val x = sc.parallelize (List), 3) scala > x.mapWith (a = > adept 10) ((aPower2) = > (baked 2)). Collectres16: Array [Int] = Array (2,2,2,12,12,22,22,22,22,22,22,22)

5. FlatMap:

Each element in the original RDD is converted into a new element through the function f, and the elements in each collection of the generated RDD are merged into a collection, creating FlatMappedRDD (this,sc.clean ()) internally. That is:

Similar to map, the difference is that the elements in the original RDD can only generate one element after map processing, while the elements in the original RDD can generate multiple elements to build a new RDD after flatmap processing.

Eg: produces y elements for each element x in the original RDD (from 1 to y RDD y is the value of element x).

Scala > val a = sc.parallelize (1 to 4Power2) scala > val b = a.flatMap (x = > 1 to x) scala > a.collectres17: Array [Int] = Array (1, 2, 3, 4) scala > b.collectres18: Array [Int] = Array (1, 1, 2, 1, 3, 4)

6. FlatMapWith:

FlatMapWith and mapWith are very similar, both receive two functions, one function takes partitionIndex as input and the output is a new type A; the other function takes binary tuple (TMague A) as input and output as a sequence, and the elements in these sequences form the new RDD.

Scala > val a = sc.parallelize (List): Array [Int] = Array) scala > a.collectres0: Array [Int] = Array) scala > a.flatMapWith (x = > x = > List). Collectres1: Array [Int] = Array (0,1,0,2,0,3,1,4,1,5,6,2,7,2) 8, 2, 9)

7. FlatMapWithValues:

FlatMapValues is similar to mapValues, except that flatMapValues is applied to Value in RDD whose elements are KV pairs. The Value of each element is mapped to a series of values by the input function, and then these values form a series of new KV pairs with the Key in the original RDD.

Scala > val a = sc.parallelize (List (1Power2), (3pje 4), (3je 6)) scala > a.collectres2: Array [(Int, Int)] = Array ((1dome 2), (3jue 4), (3jue 6)) scala > val b = a.flatMapValues (x = > x.to (5)) scala > b.collectres3: Array [(Int, Int)] = Array (1m 2), (1m 3), (1J 4), (1J 5), (3J 4), (3Q 5))

In the above example, the value of each element in the Central Plains RDD is converted into a sequence (from its current value to 5), such as the first KV pair (1Jing 2), whose value 2 is converted into 2Jing 3Jing 4pm 5. Then it and the Key in the original KV pair form a series of new KV pairs (1d2), (1d3), (1d4), (1d5).

8. Reduce:

Reduce passes a pair of elements in RDD to the input function and generates a new value, which is passed to the input function with the next element in RDD until there is only one value.

Eg: summation of elements.

Scala > val a = sc.parallelize (1 to 10) scala > a.reduce ((XMagi y) = > x + y) res5: Int = 55

9. ReduceByKey

As the name implies, reduceByKey is to reduce the Value of the same element Key in the RDD with the element KV pair, so the value of multiple elements with the same Key is taken as a value by reduce, and then a new KV pair is formed with the Key in the original RDD.

Eg: sums the values of elements with the same Key, so two elements with a Key of 3 are converted (3mem10).

Scala > val a = sc.parallelize (List (1mem2), (3mag4), (3pyr6) scala > a.reduceByKey ((xmemy) = > xmemy). Collectres6: Array [(Int, Int)] = Array ((1Magin2), (3mag10))

10. Cartesian:

Cartesian product operation (memory consumption) is performed on all elements in the two RDD, and the internal implementation returns CartesianRDD.

Scala > val a = sc.parallelize (List (1meme 2)) scala > val b = sc.parallelize (List) scala > val c = a.cartesian (b) scala > c.collectres15: Array [(Int, Int)] = Array ((1m 4), (1m 5), (1m 6), (2m 4), (2m 5), (2m 6), (2m 5), (2m 6), (3m 5), (3e 6)

11. Sample:

Sample samples the elements in the RDD collection to get a subset of all elements. The user can set whether there is a sample, percentage, random seed put back, and then determine the sampling method.

Internal implementation: SampledRDD (withReplacement,fraction,seed).

Function parameter settings:

Sampling withReplacement=true, indicating that there is a sample that is put back.

Sampling withReplacement=false, which indicates that there is no return sample.

The data is sampled according to the scale specified by fraction, and you can choose whether to replace it with a random number, and seed is used to specify the random number generator seed.

Scala > val a = sc.parallelize (1 to 100Power3) scala > a.sample (false,0.1,0). Countres16: Long = 12scala > a.sample (false,0.1,0). Collectres17: Array [Int] = Array (10, 47, 55, 73, 76, 84, 87, 88, 91, 92, 95, 98) scala > a.sample (true,0.7,scala.util.Random.nextInt (10000). Countres19: Long = 75scala > a.sample (true,0.7) Scala.util.Random.nextInt (10000). Collectres20: Array [Int] = Array (1, 3, 3, 3, 5, 6, 9, 9, 9, 9, 10, 15, 17, 20, 23, 23, 27, 28, 31, 32, 32, 34, 35, 36, 36, 36, 36, 38, 39, 41, 42, 42, 43, 45, 47, 49, 49, 50, 50, 51, 54, 55, 55, 57, 57, 57, 57, 57, 59, 61, 61, 63 67, 72, 74, 76, 76, 80, 80, 81, 81, 82, 83, 85, 87, 88, 90, 93, 95, 96, 97, 97, 99100)

12. Union:

When using the union function, you need to ensure that the data types of the two RDD elements are the same, and that the returned RDD data types are the same as the merged RDD element data types. Instead of deduplicating, save all the elements, and if you want to, you can use distinct (). At the same time, spark also provides a more concise API for using union, that is, the operation of the union function through the + + symbol.

Eg: the combination of an and b

Scala > val a = sc.parallelize (List (("A", 1), ("B", 2), ("c", 3), ("A", 4), ("C", 5)) scala > val b = sc.parallelize (List (("A", 5), ("B", 6), ("A", 4), ("C", 9) scala > a.union (b). Collectres22: Array [(String, Int)] = Array ((String, Int)) (cPermian 3), (Amae4), (Cmagent5), (Amagedr5), (Brecaine 6), (Agraine 4), (Cpreline 9)

To repeat:

Scala > val d = sc.parallelize (List (("A", 5), ("B", 6), ("A", 5)) scala > d.distinct.collectres25: Array [(String, Int)] = Array ((BMague 6), (AMagne5)

13. GroupBy:

The elements are generated into the corresponding Key through the function, and the data is converted to Key-Value format, and then the elements with the same Key are grouped together.

Eg: groups data according to the K value of each element in the dataset

Scala > val a = sc.parallelize (List (("A", 1), ("B", 2), ("c", 3), ("A", 4), ("C", 5)) scala > a.groupByKey (). Collectres21: Array [(String, Iterable [int])] = Array ((BMIT CompactBuffer (2)), (A CompactBuffer (1, 4)), (CJR CompactBuffer (5), (CJ CompactBuffer (3)

14. Join:

Join performs cogroup function operation on two RDD that need to be connected, and occasionally puts the data of the same key into a partition. After the cgroup operation, a new RDD is formed to perform Cartesian product operation on the elements under each key. The returned result is flattened, corresponding to all tuples under the key to form a set. Finally, return RDD [(K, (V, W))].

Eg:an and b are two data connections, which are equivalent to the association of tables

Scala > val a = sc.parallelize (List (("A", 1), ("B", 2), ("c", 3), ("A", 4), ("C", 5)) scala > val b = sc.parallelize (List (("A", 5), ("B", 6), ("A", 4), ("C", 9)) scala > a.join (b). Collectres23: Array [(String, (Int, Int))] = Array (B, (2pm 6)) (a, (1, 5)), (A, (1, 4)), (A, (4, 5)), (A, (4, 4)), (C, (5, 9)

15. Cache:

Cache caches RDD elements from disk to memory. Equivalent to the persist (MEMORY_ONLY) function

Function.

16. Persist:

The persist function caches RDD, and where the data is cached is determined by the enumerated type StorageLevel. DISK represents disk, MEMORY represents memory, and SER represents whether the data is serialized or not.

Function definition: persist (newLevel:StorageLevel)

StorageLevel is an enumerated type that represents a storage mode.

MEMORY_AND_DISK_SER means that data can be stored in memory and disk, and stored in a serialized manner, the same goes for others.

Two: Action: action operator

1. Foreach:

Foreach applies the f function operation to each element in the RDD, returning Uint instead of RDD and Array.

Scala > val a = sc.parallelize (List (1) 2, 3, 4, 5, 6, 7, 8), 3) scala > a.foreach (println (_)) 456789123

2. SaveAsTextFile:

The function outputs the data and stores it in the specified directory of HDFS.

The inner part of the function is implemented by calling saveAsHadoopFile:

This.map (x = > (NullWritable.get (), new Text (x.toString)

.saveAsHadoopFile [TextOutputFormat [NullWritable, Text]] (path)

Convert each element mapping in RDD to (null, x.toString), and then write it to HDFS.

3. Collect:

Collect is the equivalent of toArray, but it is outdated and not recommended. Collect returns distributed RDD as a stand-alone scala Array data, using the functional operation of scala on this array.

4. Count:

Count returns the number of elements for the entire RDD.

Scala > val a = sc.parallelize (1 to 10) scala > a.collectres9: Array [Int] = Array (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) scala > a.countres10: Long = 10 so far, I believe you have a deeper understanding of which categories of Spark RDD operators are divided into, might as well come to the actual operation! Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.