Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to analyze the characteristics of Core RDD in big data Spark

2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article introduces how to analyze the core RDD features in big data Spark. The content is very detailed. Interested friends can use it for reference. I hope it will be helpful to you.

Summary of RDD features:

A, RDD is the core abstraction provided by Spark, the full name is Resillient Distributed Dataset, that is, resilient distributed datasets.

B, RDD is abstractly a collection of elements that contains data. It is partitioned, divided into multiple partitions, each distributed on different nodes in the cluster, so that the data in the RDD can be manipulated in parallel.

C, RDD are usually created from files on the Hadoop, that is, HDFS files or Hive tables, and sometimes from collections in the application.

D, the most important feature of RDD is that it provides fault tolerance and can automatically recover from node failure. That is, if the RDD partition on a node is lost due to a node failure, the RDD will automatically recalculate the partition through its own data source. All this is transparent to the user.

E, RDD data is stored in memory by default, but when memory resources are insufficient, Spark will automatically write RDD data to disk.

Next, let's analyze its key features in detail.

Figure 1-RDD distributed feature

Analysis:

RDD (Resilient Distributed Datasets) flexible distributed dataset is an abstract concept of distributed memory. We can abstractly represent a file on a HDFS, but it is actually partitioned, divided into multiple partitions and scattered on different nodes in the Spark cluster. For example, our RDD now has 400000 pieces of data and is divided into four partition. These four partition data are stored in nodes 1, 2, 3 and 4 in the cluster, and each partition is divided into 100000 pieces of data. As shown in figure 1, such a RDD distributes data over a batch of nodes in the cluster, and each node stores only part of the partition of the RDD, which is the distributed architecture model of RDD.

Figure 2-RDD elastic feature

Analysis:

The elasticity of RDD indicates that when every partition data of RDD is stored on a Spark cluster node, it is stored in memory by default, but what should we do if there is not enough data in it? At this time, the elastic characteristics of RDD are shown. As shown in figure 2 above, a maximum of 60, 000 data can be stored in node 3 memory. As a result, we need to store a partition data of 100000, so we have to write the remaining 40, 000 data in partition to disk for storage. The allocation of this kind of storage is transparent for users, and we don't care how to store it. Although this storage mechanism has configuration parameters for us to choose, later in-depth explanation will describe how to choose a storage strategy, so it is not more difficult here. Therefore, this automatic tradeoff between memory and disk of RDD is the elastic feature of RDD.

Fault-tolerant characteristics of graph 3-RDD

Analysis:

Finally, let's take a look at the fact that the RDD is scattered on various nodes in the cluster, so what if there is a problem with the running time of a node? Here, Spark's RDD supports a powerful fault-tolerant mechanism, such as figure 3. When there is a problem running node n, the RDD will start the fault-tolerant mechanism and try to find the upstream dependent data source node 3 to re-obtain the data for calculation. Here in-depth analysis will put forward another concept, that is, DAG (directed acyclic graph), to further understand the dependency relationship of RDD. It has something to do with the underlying logic.

On how to carry out big data Spark core RDD feature analysis is shared here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report