How to understand the core RDD of Spark 07/06 Update SLTechnology News&Howtos

How to understand the core RDD of Spark

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces you how to understand the core RDD of Spark, the content is very detailed, interested friends can refer to, hope to be helpful to you.

Unlike many proprietary big data processing platforms, Spark is based on a unified and abstract RDD, which enables it to deal with different big data processing scenarios, including MapReduce,Streaming,SQL,Machine Learning and Graph, in a basically consistent way. This is what Matei Zaharia calls "designing a universal programming abstraction (Unified Programming Abstraction). This is what fascinates the little spark of Spark. To understand Spark, you need to understand RDD."

What is RDD?

RDD, known as Resilient Distributed Datasets, is a fault-tolerant, parallel data structure that allows users to explicitly store data to disk and memory, and to control data partitioning. At the same time, RDD provides a rich set of operations to manipulate the data. Among these operations, transformation operations such as map, flatMap, filter, etc., implement the monad pattern, which fits well with the collection operation of Scala.

In addition, RDD provides more convenient operations such as join, groupBy, reduceByKey (note that reduceByKey is action, not transformation) to support common data operations. Generally speaking, there are several common models for data processing, including: Iterative Algorithms,Relational Queries,MapReduce,Stream Processing. For example, Hadoop MapReduce uses the MapReduces model and Storm uses the Stream Processing model.

RDD mixes these four models so that Spark can be applied to a variety of big data processing scenarios. As a data structure, RDD is essentially a read-only collection of partition records. A RDD can contain multiple partitions, each of which is a dataset fragment. RDD can be interdependent. If each partition of RDD can only be used by one partition of one Child RDD, it is called narrow dependency;. If multiple Child RDD partitions can be dependent, it is called wide dependency. Different operations may have different dependencies depending on their characteristics.

For example, the map operation produces narrow dependency, while the join operation produces wide dependency. Spark divides dependencies into narrow and wide for two reasons. First, narrow dependencies can support the execution of multiple commands as pipes on the same cluster node, such as executing filter immediately after the execution of map. Instead, wide dependencies requires all parent partitions to be available, and may also need to invoke operations such as MapReduce to pass across nodes. Secondly, it is considered from the point of view of failure recovery.

Narrow dependencies's failure recovery is more efficient because it only needs to recalculate the lost parent partition and can recalculate it on different nodes in parallel. Wide dependencies involves multiple Parent Partitions at all levels of RDD. The following figure illustrates the difference between narrow dependencies and wide dependencies:

This picture is from An Architecture for Fast and General Data Processing on Large Clusters, a paper written by Matei Zaharia. In the figure, a box represents a RDD, and a shaded rectangle represents a partition. How does RDD ensure the efficiency of data processing? RDD provides two features: persistence and patitioning. Users can control these two aspects of RDD through persist and patitionBy functions. The partition feature and parallel computing power of RDD (RDD defines the parallerize function) make Spark make better use of scalable hardware resources. If you combine partitioning with persistence, you can deal with large amounts of data more efficiently. For example:

The partitionBy function needs to accept a Partitioner object, such as:

RDD is essentially an in-memory dataset, and when accessing RDD, the pointer points to only the part related to the operation. For example, there is a column-oriented data structure, one of which is implemented as an array of Int and the other as an array of Float. If you only need to access the Int field, the pointer to the RDD can access only the Int array, avoiding scanning the entire data structure. RDD divides operations into two categories: transformation and action. No matter how many transformation operations are performed, the RDD does not actually perform the operation, and the operation is triggered only when the action operation is performed.

In the internal implementation mechanism of RDD, the underlying interface is based on iterators, which makes data access more efficient and avoids the memory consumption of a large number of intermediate results. In implementation, RDD provides corresponding types inherited from RDD for transformation operations, for example, map operation returns MappedRDD, while flatMap returns FlatMappedRDD. When we perform a map or flatMap operation, we simply pass the current RDD object to the corresponding RDD object. For example:

These classes that inherit from RDD define compute functions. This function is triggered when the action operation is called, and the corresponding conversion operation is performed through the iterator inside the function:

RDD support for fault tolerance

Fault tolerance is usually supported in two ways: data replication or logging. Both approaches are expensive for data-centric systems because they require copying large amounts of data across cluster networks, after all, the bandwidth of data is much lower than that of memory. RDD is inherently fault tolerant. First of all, it is an immutable (immutable) dataset, and second, it can remember the operation graph (Graph of Operation) on which it was built, so when the Worker of the task fails, the operation previously performed can be obtained through the operation diagram and recalculated.

Because there is no need to use replication to support fault tolerance, the cost of data transmission across the network is well reduced. However, in some scenarios, Spark also needs to use logging to support fault tolerance. For example, in Spark Streaming, when performing update operations on data, or invoking window operations provided by Streaming, you need to restore the intermediate state of the execution process. At this point, you need to use the checkpoint mechanism provided by Spark to support that operations can be recovered from checkpoint.

For wide dependency of RDD, the most effective fault tolerance method is also checkpoint mechanism. However, it seems that the * version of Spark still does not introduce the auto checkpointing mechanism. Summary RDD is the core of Spark and the architectural foundation of the whole Spark. Its features can be summarized as follows:

It is an immutable data structure storage.

It supports distributed data structures across clusters.

The structure can be partitioned according to the key of the data record

Coarse-grained operations are provided, and these operations support partitioning

It stores data in memory, thus providing low latency

On how to understand the core RDD of Spark is shared here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.