Example Analysis of RDD operator in spark 07/08 Update SLTechnology News&Howtos

Example Analysis of RDD operator in spark

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces the example analysis of the RDD operator in spark, which has a certain reference value, and interested friends can refer to it. I hope you can learn a lot after reading this article.

Transformation operator of value type

The Transformation operator which deals with the data type of value can be divided into the following types according to the relationship between input partition and output partition of RDD transform operator.

1) one-to-one input partition and output partition.

2) many-to-one type of input partition and output partition.

3) many-to-many types of input partition and output partition.

4) the output partition is the input partition subset type.

5) there is also a special one-to-one operator type for input and output partitions: Cache type. The Cache operator caches the RDD partition.

The correspondence here refers to the correspondence of partition dependencies.

1. One-to-one input partition and output partition

(1) map (func)

Map executes a specified function on each element in the RDD to produce a new RDD called MappedRDD (this, sc.clean (f)). Any element in the original RDD has and only one element corresponds to it in the new RDD.

Each box in figure 3-4 represents a RDD partition, and the partition on the left is mapped to the new RDD partition on the right by the user-defined function fpurt-> U. But in fact, the f function will not operate on the data in a Stage with other functions until the Action operator is triggered. V1 input f conversion output Vroom1.

(2) flatMap (func)

Similar to map, but each input element is mapped to 0 to multiple output elements (therefore, the return value of the func function is a Seq, not a single element). Create a FlatMappedRDD (this, sc.clean (f)) internally.

The small box in figure 3-5 shows a partition of RDD, which is operated by the flatMap function. The functions passed in flatMap are FRV T-> UJR T and U can be any data type. Convert the data in the partition into new data through the user-defined function f. The outer large box can be thought of as a RDD partition, and the small box represents a collection. V1, V2, V3 in a set as a data item of RDD, converted to Vroom1, Vroom2, Vroom3, will be disintegrated to form data items in RDD.

(3) mapPartitions (func)

MapPartitions is a variant of map. The input function of map is applied to each element in RDD, while the input function of mapPartitions is applied to each partition, that is, the contents of each partition are treated as a whole.

The mapPartitions function gets the iterator of each partition and manipulates the elements of the entire partition through the iterator of the partition as a whole. The internal implementation is to generate MapPartitionsRDD. The box in figure 3-6 represents a RDD partition.

In figure 3-6, the user filters all the data in the partition through the function f (iter) = > iter.filter (_ > = 3), and the data of > = 3 is retained. A square represents a RDD partition, and partitions containing 1, 2, and 3 are filtered with only element 3.

(4) glom ()

The glom function forms each partition into an array, and the internal implementation is the returned GlommedRDD. Each box in figure 3-7 represents a RDD partition.

The box in figure 3-7 represents a partition. The figure shows that the partition containing V1, V2, V3 forms an array Array [(V1), (V2), (V3)] through the function glom.

2. Many-to-one input partition and output partition

(1) union (otherDataset)

When using the union function, you need to ensure that the data types of the two RDD elements are the same, the returned RDD data types are the same as the merged RDD element data types, and all elements are saved without deduplication. If you want to remove the weight, you can use distinct (). The + + symbol is equivalent to a uion function operation.

The large box on the left in figure 3-8 represents the two RDD, and the small box in the large box represents the partition of the RDD. The large box on the right represents the merged RDD, and the small box in the large box represents the partition. Contains V1 and V2... U4 RDD and contains V1BI V8... The RDD of U8 merges all the elements to form a RDD. V1, V1, V2, V8 form a partition, and other elements are merged in the same way.

(2) cartesian (otherDataset)

Perform a Cartesian product operation on all elements within the two RDD. After the operation, the internal implementation returns CartesianRDD.

The large box on the left represents the two RDD, and the small box inside the large box represents the partition of the RDD. The large box on the right represents the merged RDD, and the small box in the large box represents the partition. The large box represents RDD, and the small box in the large box represents the RDD partition. For example, V1 and W1, W2, Q5 in another RDD perform Cartesian product operations to form (V1 ~ W1), (V _ 1 ~ W2), (V _ 1 ~ Q5).

3. Input partition and output partition many-to-many type

GroupBy (func)

The elements are generated into the corresponding Key through the function, and the data is converted to Key-Value format, and then the elements with the same Key are grouped together.

In the figure, the box represents a RDD partition, and elements of the same key are merged into a group. For example, V1Magee V2 is merged into a Key-Value pair, where key is "V" and Value is "V1Magi V2", forming VMagneSeq (V1Magee V2).

4. The output partition is input partition subset type.

(1) filter (func)

The function of filter is to filter the elements, apply the f function to each element, the elements with the return value of true are retained in RDD, and those returned as false will be filtered out. The internal implementation is equivalent to generating FilteredRDD (this,sc.clean (f)).

Each box in figure 3-11 represents a RDD partition. T can be of any type. Through the user-defined filter function f, the operation of each data item will satisfy the condition, and the data item with the result of true will be retained. For example, filter out V2, V3 retains V1, and name the distinction V1'.

(2) distinct ([numTasks]))

Distinct deduplicates the elements in the RDD. The box in figure 3-12 represents the RDD partition.

Each box in figure 3-12 represents a partition that is deduplicated by the distinct function. For example, only one copy of V1 is retained after duplicate data V1 and V1 are deduplicated.

(3) subtract (other, numPartitions=None)

Subtract is equivalent to performing differential operations on collections, and RDD 1 removes all elements from the intersection of RDD 1 and RDD 2.

The large box on the left in figure 3-13 represents the two RDD, and the small box in the large box represents the partition of the RDD. The large box on the right represents the merged RDD, and the small box in the large box represents the partition. V1 is in both RDD. According to the subtraction operation rule, the new RDD is not retained. If V2 is present in the first RDD and not in the second RDD, then V2 is included in the new RDD element.

(4) sample (withReplacement, fraction, seed=None)

Sample samples the elements in the RDD collection to get a subset of all elements. The user can set whether there is a sample, percentage and random seed put back, and then determine the sampling method.

The internal implementation is to generate SampledRDD (withReplacement, fraction, seed).

The function parameters are set as follows.

WithReplacement=true, indicating that there is a sample that is put back

WithReplacement=false, which represents a sample with no return.

Each box in figure 3-14 is a RDD partition. 50% of the data is sampled through the sample function. V1, V2, U1, U2, U3, U4 sampled data V1 and U1, U2 to form a new RDD.

(5) takeSample (withReplacement, num, seed=None)

The takeSample () function and the above sample function are the same principle, but do not use relative proportional sampling, but sample according to the set number of samples, and the return result is no longer RDD, but is equivalent to Collect () on the sampled data, and the set of returned results is a stand-alone array.

The box on the left in figure 3-15 represents the partitions on the distributed nodes, and the box on the right represents the array of results returned on the stand-alone. The data is sampled by takeSample, set to sample a piece of data, and the return result is V1.

5.Cache type

(1) cache

Cache caches RDD elements from disk to memory, equivalent to the function of the persist (MEMORY_ONLY) function. The boxes in figures 3-14 represent the RDD partition.

Each box in figure 3-16 represents a RDD partition. On the left, the data partition is stored on disk, caching the data in memory through the cache operator.

(2) persist (storageLevel=StorageLevel (False, True, False, False, 1)

The persist function caches RDD. Where the data is cached is determined by the StorageLevel enumeration type. There are several types of combinations (see figure 3-15). DISK represents disk, MEMORY represents memory, and SER represents whether the data is serialized or not.

The following is the function definition. StorageLevel is an enumerated type that represents the storage mode, which the user can select on demand through figure 3-17.

The patterns that the persist function can cache are listed in figure 3-17. For example, MEMORY_AND_DISK_SER represents data that can be stored in memory and disk, and serialized. Everything else is the same. In the figure, the box represents the RDD partition. Disk represents storage on disk and mem represents storage in memory. Initially, all the data is stored on disk, and the data is cached to memory through persist (MEMORY_AND_DISK), but some partitions cannot be accommodated in memory. For example, in figure 3-18, the RDD containing V1 Magi V2 Magi V3 is stored to disk, and the RDD containing U1 Magi U2 is still stored in memory.

Thank you for reading this article carefully. I hope the article "sample Analysis of RDD operator in spark" shared by the editor will be helpful to everyone. At the same time, I also hope that you will support and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.