What is the operating mechanism of RDD? 07/15 Update SLTechnology News&Howtos

What is the operating mechanism of RDD?

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "what is the operating mechanism of RDD". In daily operation, I believe that many people have doubts about what the operating mechanism of RDD is. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful for you to answer the doubts about "what is the operating mechanism of RDD?" Next, please follow the editor to study!

RDD is Resillient Distributed Dataset, or resilient distributed dataset. It is the core abstraction provided by Spark. RDD is abstractly an abstract distributed dataset. It is partitioned, and each partition is distributed on different nodes in the cluster. Thus, the data can be calculated in parallel. Its main characteristics are flexibility and fault tolerance.

1. The design and operation principle of RDD.

The core of Spark is based on a unified abstract RDD. The transformation and action operation based on RDD make the various components of Spark integrate seamlessly, thus completing big data's computing task in the same application.

In practical applications, there are many iterative algorithms and interactive data mining tools. What these scenarios have in common is that the intermediate results will be reused between different computing phases, that is, the output of one stage will be used as the input of the next stage. The MapReduce framework in Hadoop writes intermediate results to HDFS, which brings a lot of data replication, disk IO and serialization overhead, and usually only supports some specific computing modes. On the other hand, RDD provides an abstract data architecture, so that developers do not have to worry about the distributed characteristics of the underlying data, but only need to express the specific application logic as a series of conversion processing, and the conversion operations between different RDD form dependencies, which can be pipelized, thus avoiding the storage of intermediate results, and greatly reducing the overhead of data replication, disk IO and serialization.

1.1. RDD concept

A RDD is a collection of distributed objects, which provides a highly restricted shared memory model, which is essentially a read-only set of partition records and cannot be modified directly. Each RDD can be divided into multiple partitions, each partition is a dataset fragment, and different partitions of a RDD can be saved to different nodes in the cluster, so that parallel computing can be carried out on different nodes in the cluster.

RDD provides a rich set of operations to support common data operations, divided into "Action" and "Transformation", which are used to perform calculations and specify the form of output, while the latter specify interdependencies between RDD. The conversion interfaces provided by RDD are very simple, which are coarse-grained data conversion operations such as map, filter, groupBy, join, etc., rather than fine-grained modifications to a data item. Therefore, RDD is more suitable for batch applications that perform the same operations on elements in the dataset, but not for applications that require asynchronous, fine-grained state, such as Web applications, incremental web crawlers and so on.

The typical execution process of RDD is as follows:

Read into an external data source (or a collection in memory) for RDD creation; RDD goes through a series of "conversion" operations, each of which produces a different RDD for the next transformation; and the last RDD is processed by an "action" operation and outputs the specified data type and value. RDD uses lazy calls, that is, during the execution of RDD, all conversion operations do not perform real operations, but only record dependencies, and only when an action is encountered, the real calculation is triggered and the final result is obtained according to the previous dependency.

Here is an example to describe the actual execution of RDD. As shown in the following figure, two RDD are created from the input, An and C, and then a series of transformation operations are performed to generate an F, which is also a RDD. Note that no real calculation is performed during the execution of these conversion operations, and no real calculation is performed based on the creation process, but only the data flow trajectory is recorded. When F performs the behavioral operation and generates the output data, Spark generates a directed acyclic graph (DAG) based on the dependency of RDD and performs the real calculation from the starting point. It is this lazy invocation mechanism of RDD that makes the intermediate results obtained by the conversion operation do not need to be saved, but flow directly into the next operation for processing.

1.2. RDD characteristics

Overall, the main reasons why Spark can achieve efficient computing with RDD are as follows:

Efficient fault tolerance. In the design of RDD, the data can only be modified by converting from parent RDD to child RDD, which means that we can directly use the dependency between RDD to recalculate the lost partition without data redundancy. And there is no need to record specific data and logs of various fine-grained operations, which greatly reduces the cost of fault tolerance in data-intensive applications.

The intermediate result is persisted to memory. Data is transferred between multiple RDD operations in memory and does not need to be stored and read on disk, thus avoiding unnecessary disk read and write overhead.

The stored data can be Java objects, avoiding unnecessary object serialization and deserialization overhead.

1.3. Dependencies between RDD

Different operations in RDD will cause different dependencies in different RDD partitions, which are mainly divided into narrow dependency (Narrow Dependency) and wide dependency (Wide Dependency). Among them, narrow dependency represents the one-to-one relationship or many-to-one relationship between parent RDD and child RDD, which mainly includes operations such as map, filter, union, etc., while wide dependency represents one-to-many relationship between parent RDD and child RDD, that is, one parent RDD is converted into multiple child RDD, which mainly includes operations such as groupByKey, sortByKey, etc.

For narrowly dependent RDD, all parent partitions can be calculated in a pipelined manner without causing data mixing between networks. For a wide-dependent RDD, it is usually accompanied by a Shuffle operation, that is, all the parent partition data needs to be calculated first, and then Shuffle is performed between nodes. Therefore, in data recovery, narrow dependencies only need to recalculate the lost partitions based on the parent RDD partition, and can be recalculated at different nodes in parallel. For wide dependencies, a single node failure usually means that the recalculation process involves multiple parent RDD partitions, which is expensive. In addition, Spark provides data checkpoints and logging to persist the intermediate RDD so that you do not have to go back to the very beginning when performing a failed recovery. During fault recovery, Spark compares the data checkpoint overhead with the cost of recalculating RDD partitions to automatically select the optimal recovery strategy.

1.4. Division of stages

Spark generates the DAG by analyzing the dependency relationship of each RDD, and then determines how to divide the phase by analyzing the dependency relationship between the partitions in each RDD. The specific division method is: reverse parsing in DAG, breaking when it encounters a wide dependency, and adding the current RDD to the current stage when it encounters a narrow dependency; dividing the narrow dependency into the same stage as far as possible, you can achieve pipelining calculation. For example, in the following figure, the DAG is first generated based on the reading, transformation, and behavior of the data. Then, when performing the behavioral operation, the DAG is parsed in reverse. Because the conversion from A to B and from B and F to G are wide dependencies, it needs to be disconnected from the wide dependency, which is divided into three stages. After dividing a DAG diagram into multiple "phases", each stage represents a set of tasks that are related and do not have Shuffle dependencies on each other. Each task set is submitted to the Task Scheduler (TaskScheduler) for processing, and the Task Scheduler distributes the tasks to Executor to run.

1.5. RDD running process

Through the above introduction to the concept, dependency, and phase division of RDD, and combined with the basic process of running Spark described earlier, here is a summary of the running process of RDD in Spark architecture (as shown in the following figure):

Create RDD objects; SparkContext is responsible for calculating the dependencies between RDD, and building DAG;DAGSchedule is responsible for parsing the DAG diagram into multiple phases, each phase contains multiple tasks, and each task is distributed by the task scheduler to the Executor on the work node for execution.

At this point, the study on "what is the operating mechanism of RDD" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.