What are the RDD knowledge points you need to know to learn Spark? 07/13 Update SLTechnology News&Howtos

What are the RDD knowledge points you need to know to learn Spark?

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly shows you "learn Spark need to know what RDD knowledge points", the content is easy to understand, clear, hope to help you solve your doubts, the following let Xiaobian lead you to study and study "Learning Spark need to know what RDD knowledge points there" this article.

Job scheduling

When performing a transformation operation on a RDD, the scheduler builds a directed acyclic graph (DAG) consisting of several scheduling phases (Stage) based on the lineage (pedigree) of the RDD, each of which contains as many consecutive narrow dependency transformations as possible. The scheduler calculates according to the order of directed acyclic graph, and finally gets the target RDD.

The scheduler assigns tasks to each node using a delayed scheduling mechanism and is determined according to the data storage location (data locality). If a task needs to deal with a partition that happens to be stored in a node's memory, the task is assigned to that node; if the partition is not included in memory, the scheduler finds a better location that contains the RDD and assigns the task to the node.

Corresponding to the wide dependency operation, Spark materializes the intermediate result to the node of the parent partition, which is similar to the output of MapReduce materialized map, which can simplify the fault recovery process of data. As shown in the following figure, the solid fillet box identifies RDD. The rectangle of the shadow background is the partition, and if it is already stored in memory, it is identified by a black background. The execution of the last action operation in RDD will construct each scheduling phase with wide dependencies as partitions, and the narrow dependence within each scheduling phase will form a pipeline. In this example, the output of Stage 1 is already in memory, so execute Stage 2 directly, and then execute Stage 3

How does Spark calculate the Job scheduling Phase

For a failed task, as long as the parent class information corresponding to the scheduling phase is still available, the task will be distributed to other nodes for re-execution. If some scheduling phase is not available (for example, because the Shuffle output is lost at the map node), resubmit the corresponding task and calculate the missing partition in parallel. If a task executes slowly in a job (that is, Straggler), the system executes a copy of the task on other nodes. This method is similar to MapReduce's speculative execution, and takes the first result as the final result.

Dispatcher

The RDD model decomposes computing into multiple independent fine-grained tasks, which enables it to support multiple resource sharing algorithms in multi-user clusters. In particular, each RDD application can dynamically adjust access resources during execution.

In each application, Spark runs multiple threads to submit jobs at the same time, and shares cluster resources with multiple jobs through a hierarchical fair scheduler, which is similar to Hadoop Fair Scheduler. The algorithm is mainly used to create multi-user applications based on the same memory data. For example, the Spark SQL engine has a service model that supports multi-user parallel queries. The fair scheduling algorithm ensures that short jobs can be completed as soon as possible, even if long jobs occupy full cluster resources. Spark's fair scheduling also uses delayed scheduling, which polls the data of each machine to give jobs high locality while maintaining fairness. Spark supports multi-level localization access policies (localization), including memory, disk, and rack. Because tasks are independent of each other, the scheduler also supports canceling jobs to free up resources for high-priority jobs. Yarn can be used in Spark to achieve fine-grained resource sharing, which enables Spark applications to share resources dynamically with each other or between different computing frameworks. This is also the most commonly used scheduling method of spark in production, which is based on Yarn.

RDD persistence

Spark provides uoduo storage strategy for persistent RDD:

Unserialized Java objects exist in memory serialized data is stored in memory and stored on disk

The performance of the first option is optimal because the RDD object in the memory of the Java virtual machine can be accessed directly; in the case of limited space, the second method allows users to organize memory in a more efficient way than Java objects, but at the cost of reduced performance; the third strategy is used in scenarios where the RDD is too large, and each recalculation of the RDD will incur additional resource overhead (such as Icano, etc.).

Memory is managed using the LRU recycling algorithm, and when the calculation results in a new RDD partition, but there is not enough space to store, the system reclaims space from one of its partitions from the least recently used RDD. Unless the RDD is the corresponding RDD of the new partition, in this case the Spark will keep the old partition in memory, preventing partitions of the same RDD from being looped in / out. This is critical because most of the operations will be done on all partitions of a RDD, so it is likely that the partitions that already exist in memory will be used again.

CheckPoint checkpoint

Although lineage can be used to restore RDD after an error, such a recovery takes a long time for RDD with a long lineage, so it needs to be saved to external storage through a checkpoint operation (Checkpoint).

In general, it is useful for RDD to set checkpoints that contain long lineage with wide dependencies. In this case, when a node in the cluster fails, the data calculated from each parent RDD is lost, resulting in the need for recalculation. On the contrary, for those narrowly dependent RDD. It is not necessary to check it. In this case, if a node fails, the partition data lost by RDD in that node can be recalculated from other nodes in parallel, and the computational cost is only a small part of the replication RDD.

Spark provides API to set checkpoint operations for RDD, allowing users to decide which data they need to set checkpoint operations for. In addition, because of the read-only feature of RDD, it is easier to do checkpointing than common shared memory because it does not need to care about data consistency. The above is all the contents of the article "what are the RDD knowledge points you need to know about Spark?" Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.