Operation study notes of transformation and action of Spark 07/04 Update SLTechnology News&Howtos

Operation study notes of transformation and action of Spark

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

1. The difference between transformation and action of spark

Spark has some basic transformation and action operations, in which transformation forms various types of RDD,action and does not form RDD, but accumulates, merges and saves RDD.

2. What are the transformation?

There are 13 species of transformation, including map, filter, flatMap (different from map), Sample, groupByKey, ReduceByKey, Union, Join, cogroup, crossProduct, mapValues, sort and partitionBy. And sortByKey?

1 、 map:

Val rdd = sc.parallelize (List (1, 2, 3, 4, 5, 6))

Val mapRdd = rdd.map (_ * 2) / / this is a typical functional programming

MapRdd.collect () / / the above map is transformation, and the collect here starts to execute. It is action, which returns an Array Array (2, 4, 6, 6, 10, 12)

Map (x = > (xmem1)), which maps something like map (x) to map (xmem1), which is generally used to count Key.

2 、 filter

Filter, select function

Val filterRdd = mapRdd.filter (_ > 5)

FilterRdd.collect () / / returns an Array of all data greater than 5, Array (6, 8, 10, 12)

3. Flatmap plus reduceBykey

Val wordcount = rdd.flatMap (_ .split ('). Map ((_, 1)). ReduceByKey (_ + _) / / splits each line according to spaces, and then flatMap merges multiple list into a list, and finally turns each element into a tuple

/ / then add the value with the same key elements. Refer to the function definition in the image above. For reduceByKey, the function passed in operates on value.

Wordcount.saveAsTextFile ("/ xxx/ss/aa") / / save the results to the file system

Wordcount.collect / / can get an array

4 、 groupByKey

After the files are divided by spaces, they are grouped into groupByKey by words.

Val wordcount=rdd.flatMap (_ .split (')) .map (_ .1) .groupByKey

Use collect to see the results

Wordcount.collect

5 、 Union

Two are merged into one

Val rdd1 = sc.parallelize (List (('astatine 1), (' axiom, 2)

Val rdd2 = sc.parallelize (List (('baked dint 1), (' baked, 2)

Val result_union = rdd1 union rdd2 / / the result is that the two list are merged into one, List (('aura, 2), (' baked, 1), ('baked, 2))

6 、 Join

Descartes' accumulated work, group round robin

Val rdd1 = sc.parallelize (List (('astatine 1), (' ajar, 2), ('baked, 3)

Val rdd2 = sc.parallelize (List ('aura dome 4), (' baked, 5)

Val result_union = rdd1 join rdd2 / / the result is to make a Cartesian product of two list, Array (('averse, (1) 4), ((2) 4), (' baked, (3, 5))

7 、 sortByKey

Sort, very easy to use, ha.

Val wordcount = rdd.flatMap (_ split (')). Map (_, 1). ReduceByKey (_ + _) .map (x = > (x.room2, x.room1)) .sortByKey (false) .map (x = > (x.room2, x.map))

/ / actually completed a sort by value process, sortByKey (false), indicating reverse order

What are the features of action?

There are count, collect, reduce, lookup and save5 species in action.

1 、 count

Calculate the number of rdd

Val rdd = sc.textFile ("/ xxx/sss/ee")

Rdd.count / / count rows

Rdd.cache / / can keep rdd in memory

Rdd.count / / counts the number of lines, but because of the cache above, the speed here will be very fast

2 、 collect

The collect function can extract all the data items in rdd.

Val rdd1=sc.parallelize (List ()) (()), (()

Val rdd2=sc.parallelize (List (('cymbr. 1), (' daddy.

Val result=rdd1 union rdd2

Use the collect operation to view the execution result

3 、 reduce

Map and reduce are the two cores of hadoop, map is mapping, and reduce is simplified.

Val rdd = sc.parallelize (List (1, 2, 3, 4)

Rdd.reduce (_ + _) / / reduce is an action, and the result here is 10

4 、 lookup

Looking for work.

Val rdd = sc.parallelize (List (('astatine 1), (' axiom, 2), ('baked dagger 1), (' baked, 2)

Rdd.lookup ("a") / / returns a seq. (1,2) proposes the value of all the elements corresponding to a to form a seq.

5 、 save

Query the data with the first click order and the second click order of the search results

Val rdd1 = sc.textFile ("hdfs://192.168.0.10:9000/input/SogouQ2.txt"). Map (_ .split ("\ t") / / length is 6 error, as if the log is not standard, some are 6, some are not .filter (_ .length = = 6)

Rdd1.count ()

Val rdd2=rdd1.filter (_ (3) .toInt = = 1). Filter (_ (4) .toInt = = 2) .count ()

Rdd2.saveAsTextFile ("hdfs://192.168.0.10:9000/output/sogou1111/")

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.