What are the types of RDD 04/17 Update SLTechnology News&Howtos

What are the types of RDD

2025-04-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly explains "what are the types of RDD". The content of the explanation is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "what are the types of RDD"?

I. definition of RDD

RDD (Resilient Distributed Dataset), called distributed dataset, is the most basic data abstraction in Spark. It represents a collection of immutable (data and metadata), partitioned, and parallel computing elements. It is characterized by automatic fault tolerance, location-aware scheduling and scalability.

Second, the properties of RDD

1. A group of pieces. That is, the basic unit of a dataset. For RDD, each shard is processed by a computing task and determines the granularity of parallel computing. You can specify the number of RDD shards when creating a RDD. If not, the default value will be used. The default value is the number of CPU Core assigned to the program.

two。 A function that calculates each partition. RDD in Spark is calculated in fragments, and each RDD implements the compute function to achieve this purpose. The compute function composes iterators, eliminating the need to save the results of each calculation.

The dependency between 3.RDD. Each transformation of RDD generates a new RDD, so there is a pipeline-like front-and-back dependency between RDD. Here I want to mention a concept-fault-tolerant processing: that is, when some partition data is lost, Spark can recalculate the lost partition data through this dependency, instead of recalculating all partitions of RDD.

4. A Partitioner, divider. That is, the fragmentation function of RDD. Currently, two types of sharding functions are implemented in Spark, one is hash-based HashPartitioner and the other is range-based RangePartitioner. Partitioner is available only for the RDD of key-value, and the value of Parititioner for non-key-value RDD is None. The Partitioner function determines not only the number of slices for RDD itself, but also the number of slices for parent RDD Shuffle output.

5. A list. A priority location (preferred location) for storing and accessing each Partition.

For a HDFS file, this list holds the location of the block where each Partition is located. According to the concept of "mobile data is not as good as mobile computing", Spark assigns computing tasks to the storage location of the data blocks it wants to process as much as possible when scheduling tasks.

III. RDD type

1.Transformation-> record the calculation process (record parameters, calculation method)

Conversion

Meaning

Map (func)

Returns a new RDD consisting of each input element converted by the func function

Filter (func)

Returns a new RDD consisting of input elements whose value is true after being calculated by the func function

FlatMap (func)

Similar to map, but each input element can be mapped to 0 or more output elements (so func should return a sequence, not a single element)

MapPartitions (func)

Similar to map, but runs independently on each shard of RDD, so when running on a RDD of type T, the function type of func must be Iterator [T] = > Iterator [U]

MapPartitionsWithIndex (func)

Similar to mapPartitions, but func takes an integer argument to represent the index value of the shard, so when running on a RDD of type T, the function type of func must be

(Int, Iterator [T]) = > Iterator [U]

Sample (withReplacement, fraction, seed)

The data is sampled according to the proportion specified by fraction, and you can choose whether to replace it with a random number. Seed is used to specify the random number generator seed.

Union (otherDataset)

Returns a new RDD after the union of the source RDD and the parameter RDD

Intersection (otherDataset)

Diff-> subtraction

Returns a new RDD after intersecting the source RDD and the parameter RDD

Distinct ([numTasks]))

[change the number of partitions]

Deduplicates the source RDD and returns a new RDD

GroupByKey ([numTasks])

Call on a RDD of (KQuery V) and return a RDD of (K, Iterator [V])

ReduceByKey (func, [numTasks])

Call on a (KQuery V) RDD and return a (KMagazine V) RDD. Use the specified reduce function to aggregate the values of the same key. Similar to groupByKey, the number of reduce tasks can be set by a second optional parameter.

AggregateByKey (zeroValue) (seqOp, combOp, [numTasks])

SortByKey ([ascending], [numTasks])

When called on a (KMagi V) RDD, K must implement the Ordered interface and return a RDD sorted by key (KMagol V).

SortBy (func, [ascending], [numTasks])

Similar to sortByKey, but more flexible

Join (otherDataset, [numTasks])

Called on a RDD of types (KMagol V) and (KMagol W), returning a (K, (VMagneW)) RDD of all the elements corresponding to the same key.

Cogroup (otherDataset, [numTasks])

Called on a RDD of type (K _ Iterable,Iterable V) and (K _ Magol W) and returns a RDD of type (K, (Iterable,Iterable))

Cartesian (otherDataset)

Cartesian product

Pipe (command, [envVars])

Coalesce (numPartitions)

Repartition (numPartitions)

Rezoning

RepartitionAndSortWithinPartitions (partitioner)

2.Action-> trigger to generate job (one job corresponds to one action operator)

action

Meaning

Reduce (func)

All elements in the RDD are aggregated by the func function, which must be interchangeable and parallel.

Collect ()

In the driver, return all elements of the dataset as an array

Count ()

Returns the number of elements of RDD

First ()

Returns the first element of RDD (similar to take (1))

Take (n)

An array of the first n elements of a dataset

TakeSample (withReplacement,num, [seed])

Returns an array consisting of num elements randomly sampled from the dataset. You can choose whether to replace the insufficient parts with random numbers. Seed is used to specify the random number generator seed.

TakeOrdered (n, [ordering])

TakeOrdered is similar to top, except that elements are returned in the reverse order of top

SaveAsTextFile (path)

Save the elements of the dataset as textfile to the HDFS file system or other supported file system. For each element, Spark will call the toString method to replace it with text in the file.

SaveAsSequenceFile (path)

Saving the elements in the dataset to a specified directory in Hadoop sequencefile format allows HDFS or other file systems supported by Hadoop.

SaveAsObjectFile (path)

CountByKey ()

For a RDD of type (KMagna V), return a map of (KMagneInt), which represents the number of elements corresponding to each key.

Foreach (func)

On each element of the dataset, run the function func to update.

Thank you for your reading, the above is the content of "what are the types of RDD". After the study of this article, I believe you have a deeper understanding of what the type of RDD has, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.