Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the characteristics of spark RDD

2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

This article is to share with you about the characteristics of spark RDD. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

Concept

RDD has the following characteristics:

Create: data in storage can only be stabilized by transformation (transformation, such as map/filter/groupBy/join, which is different from action action) to create RDD:1 from both data sources; 2) other RDD.

Read-only: the state is immutable and cannot be modified

Partitioning: enables elements in RDD to be partitioned (partitioning) according to that key and saved to multiple nodes. When restoring, only the data of the missing partition is recalculated and does not affect the entire system.

Path: it is called lineage in RDD, that is, RDD has plenty of information about how it came from other RDD.

Persistence: support for RDD caches that will be reused (such as in-memory or overflow to disk)

Delay calculation: like DryadLINQ, Spark delays the calculation of RDD so that it can pipeline transformation the conversion.

Operation: rich actions (action), count/reduce/collect/save, etc.

With regard to the difference between transformation and action, the former generates a new RDD, while the latter simply returns the result of an operation on the RDD to the program, not a new RDD

The underlying implementation principle of RDD

RDD is a distributed dataset, and as its name implies, its data should be partially stored on multiple machines. In fact, the data of each RDD is stored on multiple machines in the form of Block. The following figure shows the RDD storage architecture diagram of Spark, in which each Executor launches a BlockManagerSlave and manages part of the Block;, while the metadata of the Block is saved by the BlockManagerMaster of the Driver node. After BlockManagerSlave generates Block, registers the Block,BlockManagerMaster to BlockManagerMaster to manage the relationship between RDD and Block. When the RDD no longer needs storage, it will send an instruction to BlockManagerSlave to delete the corresponding Block.

The principle of RDD cache

During the conversion of RDD, not every RDD is stored, and if a RDD is reused or expensive to calculate, the RDD can be stored by calling the cache () method provided by RDD. So how is RDD's cache implemented?

The cache () method provided in RDD simply puts the RDD in the cache list. When the iterator of RDD is called, the RDD is calculated through CacheManager and stored in BlockManager, and the next time the data of the RDD is obtained, it can be read directly from BlockManager through CacheManager.

Fault-tolerant Mechanism of RDD to realize Fault-tolerant method of distributed dataset

Data checkpoints and record updates RDD uses record updates: the cost of recording all update points is high. Therefore, RDD only supports coarse-grained transformation, that is, only record a single operation performed on a single block, and then create a transform sequence (pedigree) of a RDD to store it; transform sequence means that each RDD contains information about how it is transformed from other RDD and how to reconstruct a piece of data. Therefore, the fault-tolerant mechanism of RDD is also called "pedigree" fault-tolerance. To implement this "pedigree" fault-tolerant mechanism, the biggest problem is how to express the dependency between the parent RDD and the child RDD. In fact, there are two kinds of dependencies, narrow dependency and wide dependency: narrow dependency: each data block in the child RDD only depends on a limited number of fixed data blocks in the parent RDD; wide dependency: one data block in the child RDD can depend on all the data blocks in the parent RDD. For example: map transform, the data block in the child RDD depends on only one corresponding data block in the parent RDD GroupByKey transformation, the data blocks in the child RDD will depend on the data blocks in the parent RDD, because a key may be wrong because there are two characteristics of classifying the dependencies in any data block of the parent RDD: first, the narrow dependency can directly calculate a block of data corresponding to the child RDD on a computing node by calculating a block of data of the parent RDD. On the other hand, the child RDD can not be calculated until all the data of the parent RDD is calculated, and the calculation result of the parent RDD is hash and transmitted to the corresponding node. Second, when data is lost, only the missing piece of data needs to be recalculated to recover narrow dependencies; for wide dependencies, all blocks in the ancestral RDD should be recalculated to recover. Therefore, when there is a long "bloodline" chain, especially when there is a wide dependency, it is necessary to set up data checkpoints at the appropriate time. These two characteristics also require different task scheduling mechanisms and fault-tolerant recovery mechanisms for different dependencies.

Internal design of RDD

Each RDD has five main properties:

1) A group of fragments (Partition), that is, the basic units of a dataset. For RDD, each shard is processed by a computing task and determines the granularity of parallel computing. You can specify the number of RDD shards when creating a RDD. If not, the default value will be used. The default value is the number of CPU Core assigned to the program. Figure 3-1 depicts the computing model of partitioned storage, with each allocated storage implemented by BlockManager. Each partition is logically mapped to a Block of BlockManager, and this Block is calculated by a Task.

2) A function that calculates each partition. RDD in Spark is calculated in fragments, and each RDD implements the compute function to achieve this purpose. The compute function composes iterators, eliminating the need to save the results of each calculation.

3) the dependency relationship between RDD. Each transformation of RDD generates a new RDD, so there is a pipeline-like front-and-back dependency between RDD. When some partition data is lost, Spark can recalculate the lost partition data through this dependency instead of recalculating all partitions of RDD.

4) A Partitioner, that is, the sharding function of RDD. Currently, two types of sharding functions are implemented in Spark, one is hash-based HashPartitioner and the other is range-based RangePartitioner. Partitioner is available only for the RDD of key-value, and the value of Parititioner for non-key-value RDD is None.

The Partitioner function determines not only the number of slices for RDD itself, but also the number of slices for parent RDD Shuffle output.

5) A list that stores the priority location (preferred location) for each Partition. For a HDFS file, this list holds the location of the block where each Partition is located. According to the concept of "mobile data is not as good as mobile computing", Spark assigns computing tasks to the storage location of the data blocks it wants to process as much as possible when scheduling tasks.

Take, for example, several RDD built into Spark:

Info / RDD

HadoopRDD

FilteredRDD

JoinedRDD

Partitions

Each HDFS block has a partition to form a collection

Same as parent RDD

One partition per Reduce task

PreferredLoc

HDFS block location

None (or ask the parent RDD)

None

Dependencies

None (parent RDD)

One-on-one with the parent RDD

Mix each RDD

Iterator

Read the corresponding block data

Filter

Join mixed data

Partitioner

None

None

HashPartitioner

working principle

It is mainly divided into three steps: creating RDD objects, DAG scheduler creating execution plan, Task scheduler assigning tasks and scheduling Worker to start running.

Take a look at how RDD works with the following example of finding the total number of different names under the same acronym by the acronym Amurz.

Step 1: create a RDD. In the above example, except that the last collect is an action and does not create a RDD, the first four transformations create a new RDD. So the first step is to create all the RDD (the five pieces of information inside).

Step 2: create an execution plan. Spark is pipelined as much as possible and stage based on whether the data is to be reorganized. For example, the groupBy () transformation in this case divides the entire execution plan into two phases. Eventually, a DAG (directed acyclic graph, directed acyclic graph) is generated as the logical execution plan.

Step 3: schedule tasks. Each phase is divided into different tasks (task), each of which is a combination of data and computing. All tasks in the current phase must be completed before moving on to the next phase. Because the first transformation in the next phase must reorganize the data, you must wait until all the resulting data in the current phase has been calculated before continuing.

Thank you for reading! This is the end of this article on "what are the characteristics of spark RDD?". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report