What is the common RDD of Spark in big data's development? 07/11 Update SLTechnology News&Howtos

What is the common RDD of Spark in big data's development?

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

What is the common RDD of Spark in the development of big data? aiming at this problem, this article introduces the corresponding analysis and answer in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.

1. Five basic Properties

A list of partitions

A function for computing each split

A list of dependencies on other RDDs

Optionally, a Partitioner for key-value RDDs (e.g. To say that the RDD is hash-partitioned)

Optionally, a list of preferred locations to compute each split on (e.g. Block locations for an HDFS file)

This is written in the comments in the source code of RDD. Here are five characteristic attributes.

1.1 Subarea

A group of fragments (Partition), that is, the basic units of a dataset. For RDD, each shard is processed by a computing task and determines the granularity of parallel computing. You can specify the number of RDD shards when creating a RDD. If not, the default value will be used.

1.2 calculated function

A function that calculates partition data. RDD in Spark is calculated in fragments, and each RDD implements the compute function to achieve this purpose. The compute function combines iterators without saving the results of each calculation.

1.3 dependency

There is a dependency relationship between RDD. Each transformation of RDD generates a new RDD,RDD that forms a pipelined lineage between them. When some partition data is lost, Spark can recalculate the lost partition data through this dependency instead of recalculating all partitions of RDD

1.4 Separator

For the RDD of key-value, there may be a Partitioner. Spark implements two types of sharding functions, one is hash-based HashPartitioner, and the other is range-based RangePartitioner. Only the RDD of key-value can have Partitioner, and the value of Parititioner of non-key-value RDD is None. The Partitioner function determines the number of fragments in the RDD itself, and also determines the number of fragments in the parent RDD Shuffle output.

1.5 priority storage location

A list that stores the priority location (preferred location) for each Partition. For a HDFS file, this list holds the location of each Partition block. According to the concept of "mobile data does not move computing", Spark will try its best to assign computing tasks to the storage location of the data blocks it wants to process when scheduling tasks.

2. Common operators between RDD transformations

Starting with the basic characteristics of the previous RDD, the program that is often written in the work is to create the transformation of the RDD,RDD, the execution of the RDD operator, and the necessary steps to create the data corresponding to the external system flowing into the Spark cluster. As for the data created from the collection, it is generally used during testing, so it is not detailed. The transformation of RDD corresponds to a special operator called Transformation, which is used by lazy loading, and the action corresponds to the operation that triggers Transformation to perform, usually output to the collection, or print out, or return a value, in addition, output from the cluster to other systems, there is a professional word called Action.

2.1 Common conversion operators

Conversion operators, that is, conversion operations from one RDD to another RDD, correspond to some built-in Compute functions, but these functions are divided into wide dependency operators and narrow dependency operators with or without shuffle.

2.1.1 the difference between wide and narrow dependencies

Generally, there are two kinds of online articles, one is the definition of handling, that is, whether a parent RDD partition will be dependent by multiple child partitions, and the other is to see if there is Shuffle. If there is Shuffle, it is a wide dependency, and if there is no Shuffle, it is a narrow dependency. The first is reliable, and the second is to take itself as an example, so there is no reference value. 2.1.3 how to distinguish between wide and narrow dependencies, you can look at this.

2.1.2 Common operators with wide and narrow dependencies narrowly dependent on common operators

Map (func): use func for each element in the dataset, and then return a new RDD filter (func): use func for each element in the dataset, and then return a RDD flatMap (func) containing elements that make func true: like map, each input element is mapped to 0 or more output elements mapPartitions (func): much like map, but map acts func on each element On the other hand, mapPartitions acts on the whole partition by func. Suppose a RDD has N elements and M partitions (N > > M), then the function of map will be called N times, while the function in mapPartitions will only be called M times, processing all the elements in a partition mapPartitionsWithIndex (func) at a time: similar to mapPartitions, there is more information about the index value of the partition.

Glom (): each partition is formed into an array to form a new RDD type RDD [Array [T]] sample (withReplacement, fraction, seed): sampling operator. According to the random sampling data of fraction with the specified random seed (seed), withReplacement indicates whether the extracted data is put back, true is the sample with put back, and false is the sample without return.

Coalesce (numPartitions,false): no shuffle, generally used to reduce partitions

Union (otherRDD): finding the union of two RDD

Cartesian (otherRDD): Cartesian product

Zip (otherRDD): a RDD that combines two RDD into key-value. By default, the number of partition and elements of the two RDD are the same, otherwise an exception will be thrown.

The difference between map and mapPartitions map: deal with one piece of data at a time mapPartitions: one partition at a time, the data can be released only after the data processing of the partition is completed. Insufficient resources can easily lead to OOM best practices: when memory resources are sufficient, it is recommended to use mapPartitions to improve processing efficiency.

Wide dependence common operators

GroupBy (func): grouped by the return value of the passed-in function. Put the same value of key into an iterator

Distinct ([numTasks]): after the RDD element is deduplicated, a new RDD is returned. NumTasks parameter can be passed to change the number of RDD partitions

Coalesce (numPartitions, true): there is shuffle. Repartition is generally used instead of adding or decreasing partitions.

Repartition (numPartitions): increase or decrease the number of partitions, with shuffle

SortBy (func, [ascending], [numTasks]): use func to process the data and sort the processed results

Intersection (otherRDD): find the intersection of two RDD

Subtract (otherRDD): find the difference between two RDD

2.1.3 how to distinguish between wide and narrow dependencies

Here I suggest that the operator that can not be understood, directly from the dependency graph of Spark's history, is there a division of Stage? if it is divided, it is a wide dependency, and if it is not divided, it is a narrow dependency. Of course, this is the practice of the practical school. When a colleague or classmate explains the problem, show your code can give it to him, and then give him the dependency graph. Of course, as a parallelist of theory and practice, I will take another kind of discrimination here. It starts with understanding the definition, that is, whether the parent RDD partition is dependent on multiple child partitions. From this point of view, you can think about whether the data from a single partition in the parent partition may flow to different child RDD partitions, such as distinct operator, or sortBy operator, global deduplication and global sorting, assuming that at the beginning, 1, 2, 3 is in one partition. After map (x = > (x, null)). ReduceByKey ((x, y) = > x). Map (_. _ 1), although the number of partitions has not changed, each partition data must look at the data of other partitions in order to know whether or not to retain it. From the input partition to the output partition, it must be converged and reorganized, so there must be shuffle. SortBy is the same.

2.2 Common Action operators

Action triggers Job. A Spark program (Driver program) contains how many Action operators, then how many Job; typical Action operators: collect / count collect () = > sc.runJob () = >. = > dagScheduler.runJob () = > triggered Job

Collect () / collectAsMap () stats / count / mean / stdev / max / min reduce (func) / fold (func) / aggregate (func)

First (): Return the first element in this RDD take (n): Take the first num elements of the RDD top (n): returns the first num elements according to the default (descending order) or the specified collation. TakeSample (withReplacement, num, [seed]): return sampled data foreach (func) / foreachPartition (func): similar to map and mapPartitions, except that foreach is Action saveAsTextFile (path) / saveAsSequenceFile (path) / saveAsObjectFile (path)

3. Common operations of PairRDD

RDD is divided into Value type and Key-Value type as a whole. The previous introduction is the operation of RDD of type Value, and actually uses more RDD of type key-value, also known as PairRDD. The operation of Value type RDD is basically concentrated in RDD.scala; the RDD operation of key-value type is concentrated in PairRDDFunctions.scala.

Most of the operators introduced above are valid for PairRDD. When the value of RDD is key-value, it can be implicitly converted to PairRDD. PairRDD also has its own Transformation and Action operators.

3.1.1 Transformation operation of common PairRDD is similar to map operation

MapValues / flatMapValues / keys / values, all of which can be implemented using map operations, is a simplified operation.

3.1.2 aggregation operations [important, difficult]

PariRDD (k, v) is widely used, aggregating groupByKey / reduceByKey / foldByKey / aggregateByKey combineByKey (OLD) / combineByKeyWithClassTag (NEW) = > underlying implementation subtractByKey: similar to subtract, deleting elements with the same key in RDD as in other RDD

Conclusion: the efficiency is equal by the most familiar method, while groupByKey is inefficient in general and used as little as possible.

3.1.3 sort operation

The sortByKey:sortByKey function acts on PairRDD to sort Key.

3.1.4 join operation

Cogroup / join / leftOuterJoin / rightOuterJoin / fullOuterJoin

Val rdd1 = sc.makeRDD (Array ((1, "Spark"), (2, "Hadoop"), (3, "Kylin"), (4, "Flink")) val rdd2 = sc.makeRDD (Array ((3, "Li Si"), (4, "Wang Wu"), (5, "Zhao Liu"), (6, "Feng Qi")) val rdd3 = rdd1.cogroup (rdd2) rdd3.collect.foreach (println) rdd3.filter {case ((v1)) V2)) = > v1.nonEmpty & v2.nonEmpty} .operations / / imitate the source code to implement join operation rdd3.flatMapValues (pair = > for (v)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.