Key-value pair operation of 6.spark core 07/15 Update SLTechnology News&Howtos

Key-value pair operation of 6.spark core

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

key-value pair RDD (pair RDD) is a common data type required by many operations in spark and is usually used for aggregate calculations.

Create Pair RDD

spark has several ways to create a pair RDD. For example, many data formats that store key-value pairs are directly returned to pair RDD; when read. Ordinary RDD is converted into pair RDD through the map () operator.

Scala# uses the first word as the key to create a pair RDDval pairs = lines.map (x = > (x.split (") (0), x)) java# also supports the lambda expression PairFunction keyData = new PairFunction () {public Tuple2 call (String x) {return new Tuple2 (x.split (") [0], x);}; JavaPairRDD pairs = lines.mapToPair (keyData) after creating a pair RDD# jdk1.8 using the first word as the key. Python# uses the first word as a key to create a pair RDDpairs = lines.map (lambda x: (x.split (") [0], x)

When creates a pair RDD from an in-memory dataset, scala and python only need to call SparkContext's parallelize () method on the binary collection; java needs to use the SparkContext.parallelizePairs () method.

Pair RDD conversion operation overview conversion operation function name for a single Pair RDD example reduceByKey (func) merge values with the same key rdd.reduceByKey ((x, y) = > x + y) groupByKey () group values with the same key rdd.groupByKey () combineByKey (createCombiner,mergeValue,mergeCombiners,partitioner) use different return types to merge values with the same key rdd.combineByKey (v = > (v, 1), (acc: (Int, Int)) V) = > (acc._1 + v, acc._2 + 1), (acc1: (Int, Int), acc2: (Int, Int)) = > (acc1._1 + acc2._1, acc1._2 + acc2._2)) mapValues (func) applies a function to each value in pair RDD without changing the key rdd.mapValues (x = > x + 1) flatMapValues (func) applies a function that returns an iterator to each value in pair RDD Generate the key value corresponding to the original key pair record rdd.flatMapValues (x = > (x to 5)) keys () returns a key-only RDDrdd.keysvalues () returns only a RDDrdd.sortByKey () that is worth RDDrdd.valuessortByKey () returns a conversion operation function name action example for two Pair RDD based on the key subtractByKey deletes the element rdd.subtractByKey (other) join with the same key as the key in the RDD to join the two RDD internally Rdd.join (other) leftOuterJoin connects two RDD Ensure that the key of the second RDD must exist (left outer join) rdd.leftOuterJoin (other) rightOuterJoin joins the two RDD, ensure that the key of the first RDD must exist (right outer join) rdd.rightOuterJoin (other) cogroup groups data with the same key in the two RDD together rdd.cogroup (other) aggregate uses mapValues () and reduceByKey () to calculate the mean of each key corresponding value. Scalardd.mapValues (x = > (x, 1)). ReduceByKey ((x, y) = > (x.room1 + y.room1, x.room2 + y.room2) pythonrdd.mapValues (lambda x: (x, 1)) .reduceByKey (lambda x, y: (x [0] + y [0]) X [1] + y [1]) calculate word statistics scalaval input = sc.textFile using flatMap (), map () and reduceByKey () val words = input.flatMap (x = > x.split (")) val result = words.map (x = > (x, 1)). ReduceByKey ((x, y) = > x + y) javaJavaRDD input = sc.textFile JavaRDD words = input.flatMap (new FlatMapFunction () {public Iterable call (String x) {return Arrays.asList (x.split ("))}); JavaPairRDD result = words.mapToPair (new PairFunction () {public Tuple2 call (String x) {return new Tuple2 (x, 1);}}) .reduceByKey (new Function2 () {public Integer call (Integer a, Integer b) {return a + b) }) pythonrdd = sc.textFile ("s3words x.split...") words = rdd.flatMap (lambda x: x.split (")) result = words.map (lambda x: (x, 1)) .reduceByKey (lambda x, y: X + y) uses combineByKey () to return return values of different types from the input data, and calculates the average of each key corresponding to each partition of rdd. two。 If the accessed element appears for the first time in the partition, the createCombiner () method is used to create the initial value of that key corresponding accumulator. 3. If the accessed element already appears in the current partition, use the mergeValue () method to merge the current value and the new value corresponding to the accumulator of the key. 4. If two or more partitions have accumulators corresponding to the same key, use the mergeCombiners () method to merge the results of each partition. Scalaval result = rdd.combineByKey (v = > (v, 1), (acc: (Int, Int), v) = > (acc._1 + v, acc._2 + 1), (acc1: (Int, Int), acc2: (Int, Int)) = > (acc1._1 + acc2._1, acc1._2 + acc2._2) .map {case (key, value) = > (key, value._1 / value._2.toFloat)} javapublic static class AvgCount implements Serializable {public int total_ Public int num_; public AvgCount (int total, int num) {total_ = total; num_ = num;} public float avg () {return total_/ (float) num_;}} Function createAcc = new Function () {public AvgCount call (Integer x) {return new AvgCount (x, 1);}} Function2 addAndCount = new Function2 () {public AvgCount call (AvgCount a, Integer x) {a.totala = x; a.num+ = 1; return a;}}; Function2 combine = new Function2 () {public AvgCount call (AvgCount a, AvgCount b) {a.totale = b.totalresin; a.numa = b.numtalresin; return a;}}; AvgCount initial = new AvgCount (0,0) JavaPairRDD avgCounts = input.combineByKey (createAcc, addAndCount, combine); Map countMap = avgCounts.collectAsMap (); for (Entry entry: countMap.entrySet ()) {System.out.println (entry.getKey () + ":" + entry.getValue () .avg ()) } pythonsumCount = input.combineByKey ((lambda x: (x, 1)), (lambda x, y: (x [0] + y, x [1] + 1)), (lambda x, y: (x [0] + y [0], x [1] + y [1]) sumCount.map (lambda key, xy: (key, xy [0] / xy [1])) .collectAsMap () grouping

uses groupByKey () when grouping individual RDD data. If you use groupByKey () first, and then reduce () or fold (), it may be more efficient to use a function that aggregates based on the key. For example, rdd.reduceByKey (func) is equivalent to rdd.groupByKey () .mapValues (value = > value.reduce (func)), but the former is more efficient because it avoids the step of storing a list of values for each key.

uses cogroup () when grouping multiple RDD that share the same key. The cogroup method yields that the RDD type is [(K, (Iterable [V], Iterable [W]))].

Connect

It is a common operation for to connect one set of key data with another set of key data. The main connection methods are: inner connection, left outer connection, right outer connection.

Val storeAddress = sc.parallelize (Seq ((Store ("Ritual"), "1026 Valencia St"), (Store ("Philz"), "748 Van Ness Ave"), (Store ("Philz"), "3101 24th St"), (Store ("Starbucks"), "Seattle")) val storeRating = sc.parallelize (Seq (Store ("Ritual"), 4.9), (Store ("Philz") StoreAddress.join (storeRating) # left outer connection storeAddress.leftOuterJoin (storeRating) # right outer connection storeAddress.rightOuterJoin (storeRating) sort

It is a common scenario for to sort the data out. The sortByKey () function takes a parameter called ascending, indicating whether to sort the results in ascending order (the default true). Sometimes, you can also provide custom comparison functions. For example, custom sort integers in string order.

Scalaimplicit val sortIntegersByString = new Ordering [Int] {override def compare (a: Int, b: Int) = a.toString.compare (b.toString)} rdd.sortByKey () javaclass IntegerComparator implements Comparator {public int compare (Integer a, Integer b) {return String.valueOf (a) .compareto (String.valueOf (b))}} rdd.sortByKey (new IntegerComparator ()); pythonrdd.sortByKey (ascending=True, numPartitions=None, keyfunc=lambda x: str (x)) Pair RDD Action Operation

, like the conversion operation, all the action operations supported by the basic RDD are available on pair RDD. In addition, Pair RDD provides some additional action operations.

Example countByKey counts the elements corresponding to each key separately rdd.countByKey () collectAsMap returns the result in the form of a mapping table rdd.collectAsMap () lookup (key) returns all values rdd.lookup (3) corresponding to the specified key

Loyal to technology, love sharing. Welcome to the official account: java big data programming to learn more technical content.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.