How to use Transformation and Action 07/02 Update SLTechnology News&Howtos

How to use Transformation and Action

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article introduces the knowledge of "how to use Transformation and Action". In the operation of actual cases, many people will encounter such a dilemma. Next, let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

1. Demonstration of Transformation operator

Val conf = new SparkConf () .setAppName ("Test") .setMaster ("local")

Val sc = new SparkContext (conf)

/ / generate rdd through parallelization

Val rdd = sc.parallelize (List (5, 6, 4, 7, 3, 8, 2, 9, 10)

/ / map: multiply each element in the rdd by 2 and sort it

Val rdd2: RDD [Int] = rdd.map (_ * 2)

/ / collect returns all elements of the dataset as an array (is the Action operator)

Println (rdd2.collect () .toBuffer)

/ / filter: this RDD consists of input elements whose value is true after being calculated by the func function.

Val rdd3: RDD [Int] = rdd2.filter (_ > 10)

Println (rdd3.collect () .toBuffer)

Val rdd4 = sc.parallelize (Array ("a b c", "b c d"))

/ / flatMap: the elements in rdd4 are sliced and flattened

Val rdd5: RDD [String] = rdd4.flatMap (_ .split (""))

Println (rdd5.collect () .toBuffer)

/ / flatten flatMap (_ .flatMap (_ .split (")

/ / sample random sampling

/ / withReplacement indicates whether the extracted data is put back, true is the sample with return, and false is the sample with no return.

/ / fraction sampling ratio such as 30% or 0.3, but this value is a floating value which is not accurate.

/ / seed is used to specify that the default parameters of the random number generator seed are not passed.

Val rdd5_1 = sc.parallelize (1 to 10)

Val sample = rdd.sample (false,0.5)

Println (sample.collect () .toBuffer)

/ / union: join set

Val rdd6 = sc.parallelize (List (5, 6, 7, 8))

Val rdd8 = rdd6 union rdd7

Println (rdd8.collect.toBuffer)

/ / intersection: find the intersection

Val rdd9 = rdd6 intersection rdd7

Println (rdd9.collect.toBuffer)

/ / distinct: to repeat

Println (rdd8.distinct.collect.toBuffer)

/ / join the same key will be merged

Val rdd10_1 = sc.parallelize (List (("tom", 1), ("jerry", 3), ("kitty", 2)

Val rdd10_2 = sc.parallelize (List (("jerry", 2), ("tom", 2), ("dog", 10)

Val rdd10_3 = rdd10_1 join rdd10_2

Println (rdd10_3.collect () .toBuffer)

/ / left connection and right connection

/ / it is of type Option except the base value, and Option is used because there may be null values.

Val rdd10_4 = rdd10_1 leftOuterJoin rdd10_2 / / based on the left, not null

Val rdd10_5 = rdd10_1 rightOuterJoin rdd10_2 / / based on the right, not null

Println (rdd10_4.collect () .toList)

Println (rdd10_5.collect () .toBuffer)

Val rdd11_1 = sc.parallelize (List (("tom", 1), ("jerry", 3), ("kitty", 2)

/ / Cartesian product

Val rdd11_3 = rdd11_1 cartesian rdd11_2

Println (rdd11_3.collect.toBuffer)

/ / grouping according to the passed parameters

Val rdd11_5_1 = rdd11_4.groupBy (_ .1)

Println (rdd11_5_1.collect () .toList)

Val rdd11_5_2 = rdd11_4.groupByKey

Println (rdd11_5_2.collect () .toList)

/ / the difference between cogroup and groupBykey

/ / cogroup does not need to merge the data first to group the data. The result is the same key and data sets in different datasets.

/ / groupByKey needs to be merged first and then grouped according to the same key

Val rdd11_6: RDD [(String, (Iterable [Int], Iterable [int]))] = rdd11_1 cogroup rdd11_2

Println (rdd11_6)

II. Demonstration of Action operator

Val conf = new SparkConf () .setAppName ("Test") .setMaster ("local [*]")

Val sc = new SparkContext (conf)

/ * Action operator * /

/ / aggregate function

Val rdd1 = sc.parallelize (List (2, 1, 3, 6, 5)

Val rdd1_1 = rdd1.reduce (_ + _)

Println (rdd1_1)

Println (rdd1.collect () .toBuffer)

/ / return the number of elements of RDD

Println (rdd1.count ())

/ / take out the corresponding number of values by default in descending order. If you enter 0, an empty array will be returned.

Println (rdd1.top (3) .toBuffer)

/ / take out the values of the corresponding quantity in sequence

/ / order to take out the value of the corresponding quantity by default

Println (rdd1.takeOrdered (3) .toBuffer)

/ / getting the first value is equivalent to take (1)

Println (rdd1.first ())

/ / write the processed data into a file (stored in HDFS or local file system)

/ / rdd1.saveAsTextFile ("dir/file1")

/ / count the number of key and generate map k is the number of key name v is the number of key

Val rdd2 = sc.parallelize (List (("key1", 2), ("key2", 1), ("key3", 3), ("key4", 6), ("key5", 5)), 2)

Println (rdd2_1)

/ / traversing data

/ * other operators * /

/ / count the number of value but treat an element in the collection as a vluae

Val value: collection.Map [(String, Int), Long] = rdd2.countByValue

Println (value)

/ / filterByRange: filter the elements in RDD to return data within the specified range

Val rdd3 = sc.parallelize (List (("e", 5), ("c", 3), ("d", 4), ("c", 2), ("a", 1)

Val rdd3_1: RDD [(String, Int)] = rdd3.filterByRange ("c", "e") / / including starting and ending

Println (rdd3_1.collect.toList)

Val rdd3_2 = sc.parallelize (List (("a", "1 2"), ("b", "3 4")

Println (rdd3_2.flatMapValues (_ .split (")) .roomt.toList)

/ / foreachPartition loops partition data

/ / foreachPartiton is generally used for data persistence, stored in the database, and can be partitioned for data storage.

Val rdd4 = sc.parallelize (List (1, 2, 3, 4, 5, 6, 7, 8, 9)

Rdd4.foreachPartition (x = > println (x.reduce (_ + _)

/ / keyBy takes the returned value of the function as key, and the element in RDD is the new tuple of value

Val rdd5 = sc.parallelize (List ("dog", "cat", "pig", "wolf", "bee"), 3)

Val rdd5_1: RDD [(Int, String)] = rdd5.keyBy (_ .length)

Println (rdd5_1.collect.toList)

/ / keys get all key values get all values

Println (rdd5_1.keys.collect.toList)

Println (rdd5_1.values.collect.toList)

/ / collectAsMap converts the required tuples to Map

Println (map)

That's all for "how to use Transformation and Action". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.