What is the dependency and caching of distributed dataset SparkRDD 07/09 Update SLTechnology News&Howtos

What is the dependency and caching of distributed dataset SparkRDD

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article will explain in detail about the dependence and caching of SparkRDD on distributed datasets. The content of the article is of high quality, so the editor shares it for you as a reference. I hope you will have some understanding of the relevant knowledge after reading this article.

Introduction to RDD

RDD (Resilient Distributed Dataset), called distributed dataset, is the most basic data abstraction in Spark. It represents an immutable, partitioned set of elements that can be computed in parallel. RDD is a class

Properties of RDD

1. A list that stores the priority location (preferred location) for each Partition. For a HDFS file, this list holds the location of the block where each Partition is located. According to the concept of "mobile data is not as good as mobile computing", Spark assigns computing tasks to the storage location of the data blocks it wants to process as much as possible when scheduling tasks.

two。 The function that calculates each partition is saved, and this calculation method is applied to each data block. RDD in Spark is calculated in shards, and each RDD implements the compute function to achieve this purpose. The compute function composes iterators, eliminating the need to save the results of each calculation.

The dependency between 3.RDD. Each transformation of RDD generates a new RDD, so there is a pipeline-like front-and-back dependency between RDD. When some partition data is lost, Spark can recalculate the lost partition data through this dependency instead of recalculating all partitions of RDD.

The fragmentation function (Partitioner) of 4.RDD, one is a hash-based HashPartitioner and the other is a range-based RangePartitioner. Only for RDD of key-value will there be Partitioner, and the value of Parititioner of non-key-value RDD is None. The Partitioner function determines not only the number of slices for RDD itself, but also the number of slices for parent RDD Shuffle output.

5. A group of fragments (Partition), that is, the basic units of a dataset. For RDD, each shard is processed by a computing task and determines the granularity of parallel computing. You can specify the number of RDD shards when creating a RDD. If not, the default value will be used. The default value is the number of CPU Core assigned to the program.

How to create a RDD

1. Create a RDD (parallelize,makeRDD) by serializing the collection

two。 By reading an external data source (testFile)

3. Convert the transformation operation to row RDD through other rdd

Two operators of RDD:

1.Transformation

Map (func): returns a new distributed dataset consisting of each original element transformed by the func function

Filter (func): returns a new dataset consisting of the original elements whose value is true after the func function

FlatMap (func): similar to map, but each input element is mapped to 0 to multiple output elements (therefore, the return value of the func function is a Seq, not a single element)

Sample (withReplacement, frac, seed):

The data is sampled according to the proportion specified by fraction, and you can choose whether to replace it with a random number. Seed is used to specify the random number generator seed.

Union (otherDataset): returns a new dataset consisting of a combination of the original dataset and parameters

ReduceByKey (func, [numTasks]): used on a dataset of (KMagi V) pairs, returns a dataset of (KMagne V) pairs, with the same value of key, all aggregated together using the specified reduce function. Like groupbykey, the number of tasks can be configured with a second optional parameter.

Join (otherDataset, [numTasks]):

Called on a dataset of types (K, (V) V) and (K, (V) W), returning a dataset in which all elements in each key are together.

GroupWith (otherDataset, [numTasks]): called on a dataset of type (KMager V) and (KMagol W) to return a dataset consisting of (K, Seq [V], Seq [W]) Tuples. This operation is done in other frameworks, called CoGroup

Cartesian (otherDataset): Cartesian product. However, when called on datasets T and U, a pair of datasets is returned, and all elements interact with Cartesian products.

Intersection (otherDataset): returns a new RDD after finding the intersection of the source RDD and the parameter RDD

Distinct ([numTasks])) deduplicates the source RDD and returns a new RDD

GroupByKey ([numTasks]) is called on a (KJE V) RDD and returns a (K, Iterator [V]) RDD.

ReduceByKey (func, [numTasks]) is called on a (KMagneV) RDD and returns a (KMagneV) RDD, which aggregates the values of the same key using the specified reduce function. Similar to groupByKey, the number of reduce tasks can be set by a second optional parameter.

AggregateByKey (zeroValue) (seqOp, combOp, [numTasks])

SortByKey ([ascending], [numTasks]) is called on a (KMagi V) RDD. K must implement the Ordered interface and return a RDD sorted by key.

SortBy (func, [ascending], [numTasks]) is similar to sortByKey, but more flexible

Join (otherDataset, [numTasks]) is called on RDD of type (KMagol V) and (KMagol W), and returns the RDD of (K, (VMagne W)) of all the elements corresponding to the same key together.

Cogroup (otherDataset, [numTasks]) is called on RDD with types (KMagna V) and (KMagol W) and returns a (K, (Iterable)

2.Action

Reduce (func) aggregates all the elements in the RDD through the func function. This function must be interchangeable and parallel.

Collect () returns all elements of the dataset as an array in the driver

Count () returns the number of elements of RDD

First () returns * elements of RDD (similar to take (1))

Take (n) returns an array of the first n elements of the dataset

TakeSample (withReplacement,num, [seed]) returns an array of num elements randomly sampled from the dataset. You can choose whether to replace the insufficient parts with random numbers. Seed is used to specify the random number generator seed.

TakeOrdered (n, [ordering])

SaveAsTextFile (path) saves the elements of the dataset as textfile to the HDFS file system or other supported file system. For each element, Spark will call the toString method to replace it with text in the file.

SaveAsSequenceFile (path) saves the elements in the dataset to a specified directory in Hadoop sequencefile format, allowing HDFS or other file systems supported by Hadoop.

SaveAsObjectFile (path)

CountByKey () returns a map of type (KMagne V) RDD, indicating the number of elements corresponding to each key.

Foreach (func) runs the function func on each element of the dataset to update it.

Dependency relationship of RDD

1. Narrow dependence

Narrow dependency refers to the use of one Partition of each parent RDD's Partition with the most quilt RDD.

Conclusion: the metaphor of narrowly relying on our image is the only child.

two。 Wide dependence

Wide dependency means that the Partition of multiple child RDD depends on the Partition of the same parent RDD.

Conclusion: the metaphor of narrowly relying on our image is super-birth.

3.Lineage (pedigree)

RDD only supports coarse-grained transformations, that is, a single operation performed on a large number of records. Record a series of Lineage (that is, lineages) that created the RDD in order to recover the lost partition. The Lineage of RDD records the metadata information and conversion behavior of the RDD. When part of the partition data of the RDD is lost, it can recalculate and recover the lost data partition based on this information.

Generation of DAG

DAG (Directed Acyclic Graph) is called directed acyclic graph. The original RDD forms DAG through a series of transformations. According to the different dependencies between RDD, DAG is divided into different Stage. For narrow dependencies, the transformation of partition is calculated in Stage. For wide dependencies, due to the existence of Shuffle, the next calculation can only be started after the completion of parent RDD processing, so wide dependencies are the basis for dividing Stage.

Caching of RDD

One of the reasons why Spark is so fast is that it can persist or cache a dataset in memory in different operations. When a RDD is persisted, each node stores the calculated shard results in memory and reuses them in other actions on the RDD or derived RDD. This makes the follow-up action faster. RDD-related persistence and caching is one of the most important features of Spark. It can be said that caching is the key to building iterative algorithms and fast interactive queries in Spark.

One of the purposes of finding dependency partition stage is to partition cache. How to set cache through stage partition?

(1) use cache when narrow dependencies want to set cache

(2) use checkpoint when the wide dependency wants to set the cache.

How do I set up cache and checkpoint?

Cache:someRDD.cache () adds a successful cache and puts it into memory

SomeRDD.persist (StorageLevel.MEMORY_AND_DISK): set the cache location (memory and hard disk) according to your own needs

Checkpoint: the data calculated by RDD can be stored on the local disk or hdfs.

Sc.setCheckpointDIr ("hdfs://hadoop1:9000/checkpoint") sets the path of the checkpoint before the wide dependency

SomeRDD.checkpoint () sets checkpoint

The difference between cache and checkpoint

Cache just caches data and does not change RDD dependencies. Checkpoint generates a new RDD, and later RDD will rely on new RDD dependencies that have changed. Order of data recovery: checkpoint-"cache--" recalculation

On the distributed dataset SparkRDD dependency and cache is shared here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.