Good programmer big data Learning Route sharing flexible distributed dataset RDD 04/12 Update SLTechnology News&Howtos

Good programmer big data Learning Route sharing flexible distributed dataset RDD

2025-04-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Good programmer big data learning route sharing flexible distributed dataset RDD,RDD definition, RDD (Resilient Distributed Dataset) is called distributed dataset, is the most basic data abstraction in Spark, it represents an immutable (data and metadata), partitioned, in which the elements can be parallel computing collection.

Features of RDD: automatic fault tolerance, location-aware scheduling and scalability

Properties of RDD

1. A group of pieces

That is, the basic unit of a dataset. For RDD, each shard is processed by a computing task and determines the granularity of parallel computing. You can specify the number of RDD shards when creating a RDD. If not, the default value will be used. The default value is the number of CPU Core assigned to the program.

two。 A function that calculates each partition.

RDD in Spark is calculated in fragments, and each RDD implements the compute function to achieve this purpose. The compute function composes iterators, eliminating the need to save the results of each calculation.

The dependency between 3.RDD.

Each transformation of RDD generates a new RDD, so there is a pipeline-like front-and-back dependency between RDD.

Fault-tolerant processing: when some partition data is lost, Spark can recalculate the lost partition data through this dependency instead of recalculating all partitions of RDD.

4. A Partitioner, divider

That is, the fragmentation function of RDD. Currently, two types of sharding functions are implemented in Spark, one is hash-based HashPartitioner and the other is range-based RangePartitioner. Partitioner is available only for the RDD of key-value, and the value of Parititioner for non-key-value RDD is None. The Partitioner function determines not only the number of slices for RDD itself, but also the number of slices for parent RDD Shuffle output.

5. A list

A priority location (preferred location) for storing and accessing each Partition. -> proximity principle

For a HDFS file, this list holds the location of the block where each Partition is located. According to the concept of "mobile data is not as good as mobile computing", Spark assigns computing tasks to the storage location of the data blocks it wants to process as much as possible when scheduling tasks.

RDD Typ

1.Transformation-> record the calculation process (record parameters, calculation method)

Conversion

Meaning

Map (func)

Returns a new RDD consisting of each input element converted by the func function

Filter (func)

Returns a new RDD consisting of input elements whose value is true after being calculated by the func function

FlatMap (func)

Similar to map, but each input element can be mapped to 0 or more output elements (so func should return a sequence, not a single element)

MapPartitions (func)

Similar to map, but runs independently on each shard of RDD, so when running on a RDD of type T, the function type of func must be Iterator [T] = > Iterator [U]

MapPartitionsWithIndex (func)

Similar to mapPartitions, but func takes an integer argument to represent the index value of the shard, so when running on a RDD of type T, the function type of func must be

(Int, Iterator [T]) = > Iterator [U]

Sample (withReplacement, fraction, seed)

The data is sampled according to the proportion specified by fraction, and you can choose whether to replace it with a random number. Seed is used to specify the random number generator seed.

Union (otherDataset)

Returns a new RDD after the union of the source RDD and the parameter RDD

Intersection (otherDataset)

Diff-> subtraction

Returns a new RDD after intersecting the source RDD and the parameter RDD

Distinct ([numTasks]))

[change the number of partitions]

Deduplicates the source RDD and returns a new RDD

GroupByKey ([numTasks])

Call on a RDD of (KQuery V) and return a RDD of (K, Iterator [V])

ReduceByKey (func, [numTasks])

Call on a (KQuery V) RDD and return a (KMagazine V) RDD. Use the specified reduce function to aggregate the values of the same key. Similar to groupByKey, the number of reduce tasks can be set by a second optional parameter.

AggregateByKey (zeroValue) (seqOp, combOp, [numTasks])

SortByKey ([ascending], [numTasks])

When called on a (KMagi V) RDD, K must implement the Ordered interface and return a RDD sorted by key (KMagol V).

SortBy (func, [ascending], [numTasks])

Similar to sortByKey, but more flexible

Join (otherDataset, [numTasks])

Called on a RDD of types (KMagol V) and (KMagol W), returning a (K, (VMagneW)) RDD of all the elements corresponding to the same key.

Cogroup (otherDataset, [numTasks])

Called on a RDD of type (K _ Iterable,Iterable V) and (K _ Magol W) and returns a RDD of type (K, (Iterable,Iterable))

Cartesian (otherDataset)

Cartesian product

Pipe (command, [envVars])

Coalesce (numPartitions)

Repartition (numPartitions)

Rezoning

RepartitionAndSortWithinPartitions (partitioner)

2.Action-> trigger to generate job (one job corresponds to one action operator)

action

Meaning

Reduce (func)

All elements in the RDD are aggregated by the func function, which must be interchangeable and parallel.

Collect ()

In the driver, return all elements of the dataset as an array

Count ()

Returns the number of elements of RDD

First ()

Returns the first element of RDD (similar to take (1))

Take (n)

An array of the first n elements of a dataset

TakeSample (withReplacement,num, [seed])

Returns an array consisting of num elements randomly sampled from the dataset. You can choose whether to replace the insufficient parts with random numbers. Seed is used to specify the random number generator seed.

TakeOrdered (n, [ordering])

TakeOrdered is similar to top, except that elements are returned in the reverse order of top

SaveAsTextFile (path)

Save the elements of the dataset as textfile to the HDFS file system or other supported file system. For each element, Spark will call the toString method to replace it with text in the file.

SaveAsSequenceFile (path)

Saving the elements in the dataset to a specified directory in Hadoop sequencefile format allows HDFS or other file systems supported by Hadoop.

SaveAsObjectFile (path)

CountByKey ()

For a RDD of type (KMagna V), return a map of (KMagneInt), which represents the number of elements corresponding to each key.

Foreach (func)

On each element of the dataset, run the function func to update.

Create RDD

Linux enters sparkShell:

/ usr/local/spark.../bin/spark-shell\

-- master spark://hadoop01:7077\

-executor-memory 512m\

-- total-executor-cores 2

Or under Maven:

Object lx03 {

Def main (args: Array [String]): Unit = {

Val conf: SparkConf = new SparkConf ()

.setAppName ("SparkAPI")

.setMaster ("local [*]")

Val sc: SparkContext = new SparkContext (conf)

/ / generate rdd through parallelization

Val rdd1: RDD [Int] = sc.parallelize (List (24pyrrine 56pyrr3, 2jor1))

/ / A pair of add1 elements multiplied by 2 and sorted

Val rdd2: RDD [Int] = rdd1.map (_ * 2) .sortBy (x = > XJE true)

Println (rdd2.collect () .toBuffer)

/ / filter out elements greater than or equal to 10

/ val rdd3: RDD [Int] = rdd2.filter (_ > = 10)

/ / println (rdd3.collect () .toBuffer)

}

Exercise 2

Val rdd1 = sc.parallelize (Array ("a b c", "d e f", "h i j"))

/ / divide each element in rdd1 and flatten it first.

Val rdd2 = rdd1.flatMap (_ .split (''))

Rdd2.collect

/ / complex:

Val rdd1 = sc.parallelize (List (List ("a b c", "a b b"), List ("e f g", "a f g"), List ("h i j", "an a b"))

/ / divide each element in rdd1 and flatten it first.

Val rdd2 = rdd1.flatMap (_ .flatMap (_ .split (")

Exercise 3

Val rdd1 = sc.parallelize (List (5,6,4,3))

Val rdd2 = sc.parallelize (List (1,2,3,4))

/ / Union set

Val rdd3 = rdd1.union (rdd2)

/ / seek the intersection

Val rdd4 = rdd1.intersection (rdd2)

/ / weight removal

Rdd3.distinct.collect

Rdd4.collect

Exercise 4

Val rdd1 = sc.parallelize (List (("tom", 1), ("jerry", 3), ("kitty", 2)

Val rdd2 = sc.parallelize (List (("jerry", 2), ("tom", 1), ("shuke", 2)

/ / ask for join

Val rdd3 = rdd1.join (rdd2)-> the same key forms a new key,value

/ / result: Array [(String, (Int,Int))] = Array ((tom, (1)), (jerry, (3)

Rdd3.collect

/ / ask for left connection and right connection

Val rdd3 = rdd1.leftOuterJoin (rdd2)

Rdd3.collect

Val rdd3 = rdd1.rightOuterJoin (rdd2)

Rdd3.collect

/ / Union set

Val rdd4 = rdd1 union rdd2

/ / grouping by key

Rdd4.groupByKey

Rdd4.collect

/ / use groupByKey and reduceByKey to count words respectively

Val rdd3 = rdd1 union rdd2

Rdd3.groupByKey () .mapValues (_ .sum) .collect

Rdd3.reduceByKey (_ + _). Collect

The difference between groupByKey and reduceByKey

The reduceByKey operator is special. It will first do local aggregation, and then global aggregation. We only need to pass a function of local aggregation.

Exercise 5

Val rdd1 = sc.parallelize (List (("tom", 1), ("tom", 2), ("jerry", 3), ("kitty", 2)

Val rdd2 = sc.parallelize (List (("jerry", 2), ("tom", 1), ("shuke", 2)

/ / cogroup

Val rdd3 = rdd1.cogroup (rdd2)

/ / Note the difference between cogroup and groupByKey

Rdd3.collect

Val rdd1 = sc.parallelize (List (1,2,3,4,5))

/ / reduce aggregation

Val rdd2 = rdd1.reduce (_ + _)

/ / sort by descending order of value

Val rdd5 = rdd4.map (t = > (t.room2, t.room1)) .sortByKey (false) .map (t = > (t.room2, t.room1))

Rdd5.collect

/ / Cartesian product

Val rdd3 = rdd1.cartesian (rdd2)

Calculate the number of elements

Scala > val rdd1 = sc.parallelize (List (2, 3, 5, 7, 7, 4))

Rdd1: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [0] at parallelize at: 27

Scala > rdd1.count

Res0: Long = 7

Top is sorted first in ascending order.

Scala > rdd1.top (3)

Res1: Array [Int] = Array (7,5,4)

Scala > rdd1.top (0)

Res2: Array [Int] = Array ()

Scala > rdd1.top

Res3: Array [Int] = Array (7,5,4,3,3,2,1)

The first N of the original set of take, there are several, take a few

Scala > rdd1.take (3)

Res4: Array [Int] = Array (2,3,1)

Scala > rdd1.take

Res5: Array [Int] = Array (2,3,1,5,7,3,4)

Scala > rdd1.first

Res6: Int = 2

Takeordered sort in reverse order and then take a value

Scala > rdd1.takeOrdered (3)

Res7: Array [Int] = Array (1,2,3)

Scala > rdd1.takeOrdered (30)

Res8: Array [Int] = Array (1,2,3,3,4,5,7)

Two ways to generate RDD

1. Parallel generation (two default partitions)

Specify the partition manually

Scala > val rdd1 = sc.parallelize (List (1pm 2m 3pm 5))

Rdd1: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [5] at parallelize at: 27

Scala > rdd1.partitions.length / / get the number of partitions

Res9: Int = 2

Scala > val rdd1 = sc.parallelize (List (1 meme 2 Jing 3 Jing 5), 3)

Rdd1: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [6] at parallelize at: 27

Scala > rdd1.partitions.length

Res10: Int = 3

two。 Use textFile to read data in a file storage system

Scala > val rdd2 = sc.textFile ("hdfs://hadoop01:9000/wordcount/input/a.txt"). FlatMap (_ .split (")). Map ((_, 1)). ReduceByKey (_ + _)

Rdd2: org.apache.spark.rdd.RDD [(String, Int)] = ShuffledRDD [11] at reduceByKey at: 27

Scala > rdd2.collect / / call the operator to get the RDD display result

Res11: Array [(String, Int)] = Array ((hello,6), (beijing,1), (java,1), (gp1808,1), (world,1), (good,1), (qianfeng,1))

Scala > val rdd2 = sc.textFile ("hdfs://hadoop01:9000/wordcount/input/a.txt", 4). FlatMap (_ .split (")). Map ((_, 1)). ReduceByKey (_ + _)

Rdd2: org.apache.spark.rdd.RDD [(String, Int)] = ShuffledRDD [26] at reduceByKey at: 27

Scala > rdd2.partitions.length / / you can also specify the number of partitions yourself

Res15: Int = 4

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.