Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Complete decryption of Spark CheckPoint (41)

2025-02-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

What on earth is Checkpoint?

1. Spark is often faced with a large number of RDD of Tranformations in production environment (for example, a Job contains 10,000 RDD) or the RDD generated by a specific Tranformation is very complex and time-consuming (for example, the calculation often takes more than 1 hour). At this time, we must consider the persistence of the calculated result data.

2. Spark is good at multi-step iteration and Job-based reuse. At this time, if the data generated by the calculated process can be reused, the efficiency can be greatly improved.

3, if you use persist to put the data in memory, it is the fastest but also the least reliable; if you put it on disk, it is not completely reliable! For example, the disk will be damaged.

4. Checkpoint is created for relatively more reliable persistence of data. In Checkpoint, you can specify a way to place data locally and with multiple copies, but in a normal production environment, it is placed in HDFS, which naturally relies on the high fault tolerance and reliability of HDFS to achieve maximum reliable persistence of data.

5. Checkpoint is to maximize the reliability of the advanced function of Spark for reusing RDD computing data. Through Checkpoint, we ensure the maximum security of the data by persisting HDFS of the data.

6. Checkpoint starts the data persistence reuse strategy based on HDFS, etc., aiming at the links in the whole RDD computing chain that especially need data persistence (the RDD of the current link will be used repeatedly later). Fault tolerance and high availability can be achieved by starting the checkpoint mechanism for RDD.

II. Principle and mechanism of Checkpoint

1. Call the SparkContext.setCheckpointDir method to specify where the RDD performing the Checkpoint operation places the data, which is placed on the HDFS in the production cluster. At the same time, in order to improve efficiency, many directories can be specified for the use of checkpoint.

2. When performing the checkpoint of RDD, all the RDD it depends on will be cleared from the computing chain.

3. As a best practice, it is generally necessary to persist the data of the current RDD to memory or disk by performing persist before the checkpoint method call. This is because checkpoint is at the Lazy level, and there must be Job execution and after the Job execution is completed, which RDD is marked back by Checkpoint, and then the RDD marked for Checkpoint starts a new Job to execute the specific Checkpoint process.

4. Checkpoint changed RDD's Lineage.

5. When we call the checkpoint method to Checkpoint the RDD, the framework will automatically generate RDDCheckpointData. When a Job is run on the RDD, the checkpoint method in RDDCheckpointData will be triggered immediately, and doCheckpoint will be called internally. In fact, ReliableRDDCheckpointData's doCheckpoint will be called in the production environment, which will lead to the call of ReliableCheckpointRDD's writeRDDToCheckpointDirectory in the production environment, and runJob will be triggered inside the writeRDDToCheckpointDirectory method to write the data in the current RDD to the Checkpoint directory. At the same time, ReliableCheckpointRDD instances will be generated.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report