Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Action operator commonly used in Spark

2025-02-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

Brief introduction of action operator

Action class operators are also a class of operators (functions) called action operators, such as foreach,collect,count and so on. Transformations class operators delay execution and Action class operators trigger execution. An application application (that is, an application we wrote) has several Action class operators executed and several job running.

1.reduce

All elements in the dataset are aggregated by the function func, which must be associative to ensure that it can be executed concurrently correctly.

Scala > val rdd1 = sc.makeRDD (1 to 10) rdd1: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [3] at makeRDD at: 24scala > rdd1.reduce (_ + _) res3: Int = 552.collect

In driver programs, return all elements of a dataset as an array, which usually returns a sufficiently small subset of data for reuse after using filter or other operations

Scala > var rdd1 = sc.makeRDD (1 to 10) rdd1: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [2] at makeRDD at: 24scala > rdd1.collectres2: Array [Int] = Array (1,2,3,4,5,6,7,8,9,10) 3.count

Returns the number of elements in the dataset

Scala > val rdd1 = sc.makeRDD (1 to 10) rdd1: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [4] at makeRDD at: 24scala > rdd1.countres4: Long = 104.first

Returns the first element of the dataset (similar to take (1))

Scala > val rdd1 = sc.makeRDD (1 to 10) rdd1: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [5] at makeRDD at: 24scala > rdd1.firstres5: Int = 15.take

Returns an array consisting of the first n elements of the dataset. Note that this operation is not currently performed in parallel, but on the machine where the driver program is located

Scala > val rdd1 = sc.makeRDD (1 to 10) rdd1: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [7] at makeRDD at: 24scala > rdd1.take (3) res6: Array [Int] = Array (1,2,3) 6.takeSample (withReplacement,num,seed)

WithReplacement: whether it can be repeated in the result

Num: how many do you take

Seed: random seed

Returns an array consisting of randomly sampled num elements in the dataset. You can choose whether to replace the insufficient parts with random numbers. Seed is used for the specified random number generator seed.

Principle

The takeSample () function and the sample function are the same principle, but do not use relative proportional sampling, but sample according to the set number of samples. At the same time, the return result is no longer RDD, but is equivalent to collect () on the sampled data, and the set of returned results is a stand-alone array.

Scala > val rdd1 = sc.makeRDD (1 to 10) rdd1: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [20] at makeRDD at: 24scala > rdd1.takeSample (true,4,10) res19: Array [Int] = Array (10,10,2,3) 7.takeOrdered

TakeOrdered is similar to top, except that elements are returned in reverse order from top.

Top default reverse order, taskOrdered default positive order

The top method is actually the result of calling taskOrdered and then reversing it.

Def top (num: Int) (implicit ord: Ordering [T]): Array [T] = withScope {takeOrdered (num) (ord.reverse)} scala > val rdd1 = sc.makeRDD (1 to 10) rdd1: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [23] at makeRDD at: 24scala > rdd1.top (5) res22: Array [Int] = Array (10,9,8,7,6) scala > rdd1.takeOrdered (5) res23: Array [Int] = Array (1,2,3,4,5) 8.saveAsTextFile

SaveAsTextFile is used to store RDD in the file system as a text file.

Val conf = new SparkConf () .setAppName ("saveFile") .setMaster ("local [*]") val sc = new SparkContext (conf) val rdd1: RDD [Int] = sc.parallelize (1 to 10) rdd1.repartition (1). SaveAsTextFile ("/ tmp/fff") 9.saveAsSequenceFile

SaveAsSequenceFile is used to save RDD to HDFS in the file format of SequenceFile. The method of use is similar to saveAsTextFile

10.saveAsObjectFile

SaveAsObjectFile is used to serialize elements in RDD into objects and store them in a file. The method of use is similar to saveAsTextFile

11.countByKey

Valid for RDD of type (KMagne V). Return a map of (KMagneInt) pair, indicating the number of elements that each can correspond to.

Scala > val rdd1 = sc.makeRDD (Array (("A", 0), ("A", 2), ("B", 1), ("B", 2), ("C", 3)) rdd1: org.apache.spark.rdd.RDD [(String, Int)] = ParallelCollectionRDD [3] at makeRDD at: 24scala > rdd1.countByKeyres1: scala.collection.Map [String,Long] = Map (B-> 2, A-> 2, C-> 1) 12.foreach

On each element of a dataset, the run function func,t is usually used to update an accumulator variable or to interact with an external storage system

Scala > val rdd1 = sc.makeRDD (Array (("A", 0), ("A", 2), ("B", 1), ("B", 2), ("C", 3)) rdd1: org.apache.spark.rdd.RDD [(String, Int)] = ParallelCollectionRDD [9] at makeRDD at: 24scala > rdd1.collect.foreach (println (_)) (AMague 0) (A Magazine 2) (BMague 2) (BMague 2)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report