In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces the example analysis of Map and Reduce in Spark RDD API, which is very detailed and has certain reference value. Friends who are interested must finish it!
What is RDD?
RDD is an abstract data structure type in Spark, and any data is represented as RDD in Spark. From a programming point of view, RDD can be simply thought of as an array. Unlike ordinary arrays, data in RDD is stored in partitions so that data from different partitions can be distributed on different machines and can be processed in parallel. Therefore, all Spark applications do is convert the data that needs to be processed into RDD, and then perform a series of transformations and operations on RDD to get the results. This article is the first part, which will introduce the API related to Map and Reduce in Spark RDD.
How do I create a RDD?
RDD can be created from a normal array, from a file system or from a file in HDFS.
Example: create a RDD from a normal array that contains the nine numbers 1 to 9, each in three partitions.
Scala > val a = sc.parallelize (1 to 9,3) a: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [1] at parallelize at: 12
For example: read the file README.md to create a RDD, and each line in the file is an element in the RDD
Scala > val b = sc.textFile ("README.md") b: org.apache.spark.rdd.RDD [String] = MappedRDD [3] at textFile at: 12
Although there are other ways to create RDD, in this article we mainly use the above two ways to create RDD to illustrate the API of RDD.
Map
Map executes a specified function on each element in the RDD to generate a new RDD. Any element in the original RDD has and only one element corresponds to it in the new RDD.
For example:
Scala > val a = sc.parallelize (1 to 9,3) scala > val b = a.map (x = > xylene 2) scala > a.collectres10: Array [Int] = Array (1,2,3,4,5,6,7,8,9) scala > b.collectres11: Array [Int] = Array (2,4,6,8,10,12,14,16,18)
In the above example, each element in the original RDD is multiplied by 2 to produce a new RDD.
MapPartitions
MapPartitions is a variant of map. The input function of map is applied to each element in RDD, while the input function of mapPartitions is applied to each partition, that is, the contents of each partition are treated as a whole.
Its function is defined as:
Def mapPartitions [U: ClassTag] (f: Iterator [T] = > Iterator [U], preservesPartitioning: Boolean = false): RDD [U]
F is the input function, which handles the contents of each partition. The output of the content in each partition will be passed as Iterator [T] to the input function freguery f as Iterator [U]. The final RDD is merged by the results of all partitions processed by the input function.
For example:
Scala > val a = sc.parallelize (1 to 9,3) scala > def myfunc [T] (iter: Iterator [T]): Iterator [(T, T)] = {var res = List [(T, T)] () var pre = iter.next while (iter.hasNext) {val cur = iter.next; res.:: = (pre, cur) pre = cur } res.iterator} scala > a.mapPartitions (myfunc). Collectres0: Array [(Int, Int)] = Array ((2), (1), (5), (4), (8), (7))
The function myfunc in the above example is to form a Tuple between one element in the partition and its next element. Because the last element in the partition does not have the next element, (3p4) and (6p7) are not in the result.
There are also variants of mapPartitions, such as mapPartitionsWithContext, which can pass some state information during processing to user-specified input functions. There is also mapPartitionsWithIndex, which passes the index of the partition to the user-specified input function.
MapValues
MapValues as its name implies is that the input function is applied to the Value of the Kev-Value in the RDD. The Key in the original RDD remains the same and forms the elements in the new RDD together with the new Value. Therefore, this function applies only to RDD whose elements are KV pairs.
For example:
Scala > val a = sc.parallelize (List ("dog", "tiger", "lion", "cat", "panther", "eagle"), 2) scala > val b = a.map (x = > (x.length, x)) scala > b.mapValues ("x" + "x"). Collectres5: Array [(Int, String)] = Array ((3dome xdogx), (5memxtigerx), (4memxlionx), (3je xcatx), (7ther xpantherx), (5memxeaglex))
MapWith
MapWith is another variant of map. Map requires only one input function, while mapWith has two input functions. It is defined as follows:
Def mapWith [A: ClassTag, U:] (constructA: Int = > A, preservesPartitioning: Boolean = false) (f: (T, A) = > U): RDD [U]
The first function, constructA, takes RDD's partition index (index starts at 0) as input and outputs to the new type A.
The second function, f, takes the binary tuple (T, A) as input (where T is the element in the original RDD and An is the output of the first function), and the output type is U.
For example: multiply partition index by 10, and then add 2 as the element of the new RDD.
Val x = sc.parallelize (List (1, Array 2, 4, 5), 6, 7, 8, 9, 10), 3) x.mapWith (a = > a * 10) ((a, b) = > (b + 2). Collect res4: Array [Int] = Array (2,2,2,12,12,22,22,22,22)
FlatMap
Similar to map, the difference is that the elements in the original RDD can only generate one element after map processing, while the elements in the original RDD can generate multiple elements to build a new RDD after flatmap processing.
For example: y elements are generated for each element x in the original RDD (from 1 to y focus y is the value of element x)
Scala > val a = sc.parallelize (1 to 4,2) scala > val b = a.flatMap (x = > 1 to x) scala > b.collectres12: Array [Int] = Array (1, 1, 2, 1, 3, 1, 2, 3, 4)
FlatMapWith
FlatMapWith and mapWith are very similar, both receive two functions, one function takes partitionIndex as input and the output is a new type A; the other function takes binary tuple (TMague A) as input and output as a sequence, and the elements in these sequences form the new RDD. It is defined as follows:
Def flatMapWith [A: ClassTag, U: ClassTag] (constructA: Int = > A, preservesPartitioning: Boolean = false) (f: (T, A) = > Seq [U]): RDD [U]
For example:
Scala > val a = sc.parallelize (List) scala > a.flatMapWith (x = > x, true) ((x, y) = > List (y, x)). Collectres58: Array [Int] = Array (0,1,0,2,1,4,1,5,1,6,2,2,2)
FlatMapValues
FlatMapValues is similar to mapValues, except that flatMapValues is applied to Value in RDD whose elements are KV pairs. The Value of each element is mapped to a series of values by the input function, and then these values form a series of new KV pairs with the Key in the original RDD.
Give an example
Scala > val a = sc.parallelize (List ((1mem2), (3pje 4), (3je 6)) scala > val b = a.flatMapValues (x = > x.to (5)) scala > b.collectres3: Array [(Int, Int)] = Array ((1jue 2), (1je 3), (1je 4), (1je 5), (3e 4), (3pm 5))
In the above example, the value of each element in the Central Plains RDD is converted into a sequence (from its current value to 5), such as the first KV pair (1Jing 2), whose value 2 is converted into 2Jing 3Jing 4pm 5. Then it and the Key in the original KV pair form a series of new KV pairs (1d2), (1d3), (1d4), (1d5).
Reduce
Reduce passes a pair of elements in RDD to the input function and generates a new value, which is passed to the input function with the next element in RDD until there is only one value.
Give an example
Scala > val c = sc.parallelize (1 to 10) scala > c.reduce ((x, y) = > x + y) res4: Int = 55
The above example sums the elements in RDD.
ReduceByKey
As the name implies, reduceByKey is to reduce the Value of the same element Key in the RDD with the element KV pair, so the value of multiple elements with the same Key is taken as a value by reduce, and then a new KV pair is formed with the Key in the original RDD.
For example:
Scala > val a = sc.parallelize (List (1Mague 2), (3je 4), (3je 6) scala > a.reduceByKey ((x rep y) = > x + y). Collectres7: Array [(Int, Int)] = Array ((1m 2), (3je 10))
In the above example, the values of the elements with the same Key are summed, so the two elements with a Key of 3 are converted (3J10).
The above is all the content of the article "sample Analysis of Map and Reduce in Spark RDD API". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.