How to understand the core concept RDD in Spark 07/07 Update SLTechnology News&Howtos

How to understand the core concept RDD in Spark

2025-07-07 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

How to understand the core concept of Spark RDD, I believe that many inexperienced people do not know what to do, so this paper summarizes the causes of the problem and solutions, through this article I hope you can solve this problem.

The full name of RDD is elastic distributed dataset (Resilient Distributed Datasets). It is a distributed memory abstraction that represents a collection of read-only record partitions that can only be created through other RDD transformations. For this reason, RDD supports rich transformation operations (such as map, join, filter, groupBy, etc.). Through this transformation operation, the new RDD contains the necessary information on how to derive from other RDDs. So there is a dependency between RDDs.

Based on the dependency between RDDs, RDDs will form a directed acyclic graph DAG, which describes the whole flow of streaming computing. In practice, RDD is accomplished through consanguinity (Lineage). Even if data partitions are lost, partitions can be rebuilt by consanguinity.

To sum up, RDD-based streaming computing tasks can be described as loading records from stable physical storage (such as distributed file systems), passing records into DAG consisting of a set of deterministic operations, and then writing back to stable storage. In addition, RDD can cache data sets in memory, so that data sets can be reused among multiple operations. Based on this feature, iterative applications (graph computing, machine learning, etc.) or interactive data analysis applications can be easily built.

It can be said that Spark is a distributed system to achieve RDD at first, and then it has become a relatively perfect big data ecosystem through continuous development. To put it simply, the relationship of Spark-RDD is similar to that of Hadoop-MapReduce.

Characteristics of RDD

RDD represents the dataset of a read-only partition. To make changes to RDD, you can only get a new RDD from a RDD through the conversion operation of RDD, and the new RDD contains the necessary information to derive from other RDD.

There is a dependency between RDDs, and the execution of RDD is calculated according to consanguinity delay. If the consanguinity is long, you can sever the consanguinity by persisting RDD.

Zoning

As shown in the following figure, RDD is logically partitioned, the data of each partition exists abstractly, and the data of each partition is obtained through a compute function during calculation.

If the RDD is built from an existing file system, the compute function reads the data in the specified file system, and if the RDD is converted through another RDD, the compute function performs the conversion logic to convert the data from the other RDD.

Read-only

As shown in the following figure, RDD is read-only, and if you want to change the data in RDD, you can only create a new RDD based on the existing RDD.

The transformation from one RDD to another RDD can be achieved through a wealth of operators, instead of having to write map and reduce as MapReduce does, as shown in the following figure.

RDD operators include two types, one is called transformations, which is used to transform RDD to build the consanguinity of RDD; the other is called actions, which is used to trigger the calculation of RDD, get the relevant calculation results of RDD or save RDD in the file system. The following figure is a list of operators supported by RDD.

Dependence

RDDs carries on the transformation through the operator, and the new RDD obtained by the transformation contains the necessary information derived from other RDDs. The RDDs maintains this kinship, also known as dependency. As shown in the following figure, there are two kinds of dependencies, one is narrow dependency, the other is one-to-one partition between RDDs, and the other is wide dependency. Each partition of downstream RDD is related to each partition of upstream RDD (also known as parent RDD), which is a many-to-many relationship.

Through this dependency between RDDs, a task flow can be described as DAG (directed acyclic graph). As shown in the following figure, the wide dependency corresponds to Shuffle (reduceByKey and join in the figure), and all transformation operations in narrow dependencies can be performed in a pipeline-like way (in the figure, map and union can be executed together).

Caching

If the same RDD is used multiple times in the application, the RDD can be cached. The RDD will only get partitioned data according to consanguinity during * calculations, and when the RDD is used in other places, it will be taken directly from the cache instead of calculating based on consanguinity, thus speeding up later reuse.

As shown in the figure below, RDD-1 gets RDD-n after a series of transformations and saves it to hdfs,RDD-1. If it is cached in memory, the previous RDD-0 will not be calculated during the subsequent conversion from RDD-1 to RDD-m.

Checkpoint

Although the consanguinity of RDD can naturally achieve fault tolerance, when a partition data of RDD fails or is lost, it can be rebuilt by consanguinity. However, for long-term iterative applications, the consanguinity between RDDs will become longer and longer as the iteration progresses. Once an error occurs in the subsequent iteration, it will need to be rebuilt through a very long consanguinity, which is bound to affect performance.

For this reason, RDD supports checkpoint to save data to persistent storage, which can sever the previous consanguinity, because the RDD after checkpoint does not need to know its parent RDDs, it can get the data from checkpoint.

Summary

To sum up, given a RDD, we can know at least the following information: 1, the number of partitions and how to partition; 2, the relevant dependency information derived from the parent RDDs; 3, calculate the data of each partition, the calculation steps are: 1) if cached, then the partition data taken from the cache; 2) if it is Checkpoint, then recover the data from the Checkpoint; 3) calculate the data of the partition according to the blood relationship.

Programming model

In Spark, RDD is represented as an object, and RDD is transformed through method calls on the object. After a series of Transformations, the Actions can be called to trigger the calculation of the RDD, and the Action can return the results to the application (count, collect, etc.) or save the data to the storage system (saveAsTextFile, etc.). In Spark, the calculation of RDD (that is, lazy execution) is performed only when Action is encountered, so that multiple transformations can be piped at run time.

To use Spark, developers need to write a Driver program that is submitted to the cluster to schedule Worker to run, as shown in the following figure. One or more RDD is defined in Driver, and the RDD partition calculation task is performed by calling action,Worker on RDD.

Application example

The following is a simple Spark application example WordCount, which counts the number of occurrences of each word in a dataset. First, the original RDD-0 is obtained by loading data from HDFS, in which each sentence is recorded as a line of sentences in the data. After a flatMap operation, a sentence is divided into several independent words to get RDD-1, and then each word is mapped into key-value form through map operation, where key is the word itself. Value is the initial count value of 1, get RDD-2, merge all the records in RDD-2, count the count of each word, get RDD-3,*** and save it to HDFS.

Object WordCount {def main (args: Array [String]) {if (args.length)

< 2) { System.err.println("Usage: WordCount "); System.exit(1); } val conf = new SparkConf().setAppName("WordCount") val sc = new SparkContext(conf) val result = sc.textFile(args(0)) .flatMap(line =>

Line.split ("") .map (word = > (word, 1)) .reduceByKey (_ + _) result.saveAsTextFile (args (1))}}

Conclusion

What are the advantages of Spark based on RDD over traditional Hadoop MapReduce? To sum up, there should be at least three points:

1.RDD provides a wealth of operators, which is no longer just map and reduce. It is more convenient to describe applications.

two。 DAG is constructed through the conversion between RDDs, and the intermediate result does not need to hit the ground.

3.RDD supports caching and can quickly complete calculations in memory.

After reading the above, have you mastered the method of how to understand the core concept RDD in Spark? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.