What are the knowledge points of RDD? 07/09 Update SLTechnology News&Howtos

What are the knowledge points of RDD?

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Today, the editor will share with you what are the relevant knowledge points of RDD, the content is detailed and the logic is clear. I believe most people still know too much about this knowledge, so share this article for your reference. I hope you can get something after reading this article. Let's take a look at it.

Spark programming model is flexible distributed dataset (Resilient Distributed Dataset. RDD), which is an extension and extension of the MapReduce model, but it solves the defect of MapReduce: efficient data sharing in the parallel computing phase. Using the concept of efficient data sharing and the operation mode similar to MapReduce, parallel computing can be carried out efficiently and can be optimized in a specific system.

Compared with previous cluster fault-tolerant processing models, such as MapReduce, they transform the computation into a set of tasks in a directed acyclic graph (DAG). This makes it possible to effectively recover the tasks performed by failures and slow nodes in DAG in these models, but there is no other storage method in these models except the file system, which leads to frequent data replication on the network, resulting in Icano pressure.

Because RDD provides an interface based on coarse-grained transformations (such as map, filter, join, etc.), the interface applies the same operation to multiple datasets, which allows them to record the "Lineage" that created the dataset without the need to store real data, thus achieving efficient fault tolerance.

When a RDD partition is lost, the RDD record has enough information to recalculate, and only the partition needs to be calculated, so that the lost data can be quickly recovered without expensive replication costs.

Multi-class model computing is implemented based on RDD mechanism, including several existing cluster programming modes. In these models, RDD not only achieves the previous system level in terms of performance, but also brings new features that existing systems lack, such as fault tolerance, slow node execution and flexible resource allocation. These models include the following aspects.

(1) iterative computing: the most common working methods at present, such as algorithms used in graph processing, numerical optimization and machine learning. RDD can support various types of models, including Pregel, MapReduce, GraphLab, and PowerGraph models.

(2) Interactive SQL query: most of the requirements in MapReduce cluster are to execute SQL query, but MapReduce has great shortcomings in interactive query compared with parallel database. The RDD of Spark not only has many common database engine characteristics and achieves considerable performance, but also provides a perfect fault-tolerant mechanism in Spark SQL, which can well deal with faults and slow nodes in short and long queries.

(3) MapReduceRDD: by providing a superset of MapReduce, MapReduce programs can be executed efficiently, as well as common DAG data flow applications such as DryadLINQ.

(4) streaming data processing: streaming data processing has been studied in the field of database and system for a long time, but large-scale streaming data processing is still a challenge. The current model does not solve the problem of frequent occurrence of slow nodes in large-scale clusters, and the solution to the failure is limited, which requires a lot of replication or waste a long recovery time. In order to recover a lost node, the current system needs to save two copies of each operation or reprocess the data before the point of failure through a series of costly serial processing.

Spark proposed discrete data flow (D-Streams) to solve this problem. D-Streams regards the execution of streaming computing as a series of short and determined sequences of batch calculations, and saves the state in RDD. D-Streams carries out parallel recovery according to the dependency diagram of the relevant RDD, which can achieve fast fault recovery and avoid data replication.

It also supports the execution of Straggler migrations through speculative execution, for example, running speculated backup copies of slow tasks. Although D-Streams increases the latency by converting the calculation into a number of unrelated jobs to run, this delay takes only sub-seconds in D-Streams cluster processing.

RDD can also support some new applications that cannot be represented by existing systems. For example, many data flow applications also need to add information about historical data; by using RDD, both batch processing and streaming can be used in the same program to achieve data sharing and fault-tolerant recovery in all models; similarly, operators of streaming applications often need to perform timely queries on the state of the data stream. Generally speaking, each batch application often needs to integrate multiple processing types. For example, an application may need to use SQL to extract data, train a machine learning model on the data set, and then query the model.

Because most of the time of computing is spent on the cost of distributed file system I > O, which shares data between systems, the workflow composed of current multiple systems is very inefficient. Using a system based on RDD mechanism, these calculations can be performed immediately in the same engine without the need for additional IWeiO operations, and the processing efficiency is greatly improved.

In Spark programming, developers need to write a driver (Driver Program) to connect to the worker process (Worker). The driver defines one or more RDD and related actions, and the driver also records the inheritance relationship of the RDD, that is, "pedigree". The working process (Worker) is a process that runs all the time, which stores the RDD partition data after a series of operations in memory.

The operations in Spark can be divided into four types: create operation, transformation operation, control operation and behavior operation.

Create operation (Creation Operation): used for RDD creation work. There are only two ways to create a RDD, one from a memory collection and an external storage system, and the other from a RDD generated by a conversion operation.

Conversion operation (Transformation Operation): transform RDD into a new RDD through a certain operation. For example, HadoopRDD can use the map operation to transform into MappedRDD, while the conversion operation of RDD is a lazy operation, which simply defines a new RDDs and is not immediately executed.

Control operation (Control Operation): RDD persistence allows RDD to be stored in disk or memory according to different storage policies. For example, the cache API caches RDD in memory by default.

Action action (Action Operation): an action that can trigger a Spark to run, for example, collect an RDD is an action action. There are two types of action operations in Spark, one of which results in Scala collections or variables, and the other saves RDD to an external file system or database.

These are all the contents of the article "what are the knowledge points of RDD". Thank you for reading! I believe you will gain a lot after reading this article. The editor will update different knowledge for you every day. If you want to learn more knowledge, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.