In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article introduces the knowledge of "how to use Transformation and Action". In the operation of actual cases, many people will encounter such a dilemma. Next, let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!
1. Demonstration of Transformation operator
Val conf = new SparkConf () .setAppName ("Test") .setMaster ("local")
Val sc = new SparkContext (conf)
/ / generate rdd through parallelization
Val rdd = sc.parallelize (List (5, 6, 4, 7, 3, 8, 2, 9, 10)
/ / map: multiply each element in the rdd by 2 and sort it
Val rdd2: RDD [Int] = rdd.map (_ * 2)
/ / collect returns all elements of the dataset as an array (is the Action operator)
Println (rdd2.collect () .toBuffer)
/ / filter: this RDD consists of input elements whose value is true after being calculated by the func function.
Val rdd3: RDD [Int] = rdd2.filter (_ > 10)
Println (rdd3.collect () .toBuffer)
Val rdd4 = sc.parallelize (Array ("a b c", "b c d"))
/ / flatMap: the elements in rdd4 are sliced and flattened
Val rdd5: RDD [String] = rdd4.flatMap (_ .split (""))
Println (rdd5.collect () .toBuffer)
/ / flatten flatMap (_ .flatMap (_ .split (")
/ / sample random sampling
/ / withReplacement indicates whether the extracted data is put back, true is the sample with return, and false is the sample with no return.
/ / fraction sampling ratio such as 30% or 0.3, but this value is a floating value which is not accurate.
/ / seed is used to specify that the default parameters of the random number generator seed are not passed.
Val rdd5_1 = sc.parallelize (1 to 10)
Val sample = rdd.sample (false,0.5)
Println (sample.collect () .toBuffer)
/ / union: join set
Val rdd6 = sc.parallelize (List (5, 6, 7, 8))
Val rdd8 = rdd6 union rdd7
Println (rdd8.collect.toBuffer)
/ / intersection: find the intersection
Val rdd9 = rdd6 intersection rdd7
Println (rdd9.collect.toBuffer)
/ / distinct: to repeat
Println (rdd8.distinct.collect.toBuffer)
/ / join the same key will be merged
Val rdd10_1 = sc.parallelize (List (("tom", 1), ("jerry", 3), ("kitty", 2)
Val rdd10_2 = sc.parallelize (List (("jerry", 2), ("tom", 2), ("dog", 10)
Val rdd10_3 = rdd10_1 join rdd10_2
Println (rdd10_3.collect () .toBuffer)
/ / left connection and right connection
/ / it is of type Option except the base value, and Option is used because there may be null values.
Val rdd10_4 = rdd10_1 leftOuterJoin rdd10_2 / / based on the left, not null
Val rdd10_5 = rdd10_1 rightOuterJoin rdd10_2 / / based on the right, not null
Println (rdd10_4.collect () .toList)
Println (rdd10_5.collect () .toBuffer)
Val rdd11_1 = sc.parallelize (List (("tom", 1), ("jerry", 3), ("kitty", 2)
/ / Cartesian product
Val rdd11_3 = rdd11_1 cartesian rdd11_2
Println (rdd11_3.collect.toBuffer)
/ / grouping according to the passed parameters
Val rdd11_5_1 = rdd11_4.groupBy (_ .1)
Println (rdd11_5_1.collect () .toList)
Val rdd11_5_2 = rdd11_4.groupByKey
Println (rdd11_5_2.collect () .toList)
/ / the difference between cogroup and groupBykey
/ / cogroup does not need to merge the data first to group the data. The result is the same key and data sets in different datasets.
/ / groupByKey needs to be merged first and then grouped according to the same key
Val rdd11_6: RDD [(String, (Iterable [Int], Iterable [int]))] = rdd11_1 cogroup rdd11_2
Println (rdd11_6)
II. Demonstration of Action operator
Val conf = new SparkConf () .setAppName ("Test") .setMaster ("local [*]")
Val sc = new SparkContext (conf)
/ * Action operator * /
/ / aggregate function
Val rdd1 = sc.parallelize (List (2, 1, 3, 6, 5)
Val rdd1_1 = rdd1.reduce (_ + _)
Println (rdd1_1)
Println (rdd1.collect () .toBuffer)
/ / return the number of elements of RDD
Println (rdd1.count ())
/ / take out the corresponding number of values by default in descending order. If you enter 0, an empty array will be returned.
Println (rdd1.top (3) .toBuffer)
/ / take out the values of the corresponding quantity in sequence
/ / order to take out the value of the corresponding quantity by default
Println (rdd1.takeOrdered (3) .toBuffer)
/ / getting the first value is equivalent to take (1)
Println (rdd1.first ())
/ / write the processed data into a file (stored in HDFS or local file system)
/ / rdd1.saveAsTextFile ("dir/file1")
/ / count the number of key and generate map k is the number of key name v is the number of key
Val rdd2 = sc.parallelize (List (("key1", 2), ("key2", 1), ("key3", 3), ("key4", 6), ("key5", 5)), 2)
Println (rdd2_1)
/ / traversing data
/ * other operators * /
/ / count the number of value but treat an element in the collection as a vluae
Val value: collection.Map [(String, Int), Long] = rdd2.countByValue
Println (value)
/ / filterByRange: filter the elements in RDD to return data within the specified range
Val rdd3 = sc.parallelize (List (("e", 5), ("c", 3), ("d", 4), ("c", 2), ("a", 1)
Val rdd3_1: RDD [(String, Int)] = rdd3.filterByRange ("c", "e") / / including starting and ending
Println (rdd3_1.collect.toList)
Val rdd3_2 = sc.parallelize (List (("a", "1 2"), ("b", "3 4")
Println (rdd3_2.flatMapValues (_ .split (")) .roomt.toList)
/ / foreachPartition loops partition data
/ / foreachPartiton is generally used for data persistence, stored in the database, and can be partitioned for data storage.
Val rdd4 = sc.parallelize (List (1, 2, 3, 4, 5, 6, 7, 8, 9)
Rdd4.foreachPartition (x = > println (x.reduce (_ + _)
/ / keyBy takes the returned value of the function as key, and the element in RDD is the new tuple of value
Val rdd5 = sc.parallelize (List ("dog", "cat", "pig", "wolf", "bee"), 3)
Val rdd5_1: RDD [(Int, String)] = rdd5.keyBy (_ .length)
Println (rdd5_1.collect.toList)
/ / keys get all key values get all values
Println (rdd5_1.keys.collect.toList)
Println (rdd5_1.values.collect.toList)
/ / collectAsMap converts the required tuples to Map
Println (map)
That's all for "how to use Transformation and Action". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.