How to analyze the alternative interpretation of spark rdd 07/19 Update SLTechnology News&Howtos

How to analyze the alternative interpretation of spark rdd

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail the alternative interpretation of how to analyze spark rdd. The content of the article is of high quality, so the editor shares it for you as a reference. I hope you will have a certain understanding of the relevant knowledge after reading this article.

RDD of 1 Spark

When it comes to Spark, it must be said that RDD,RDD is the core of Spark, if there is no in-depth understanding of RDD, it is difficult to write a good spark program, but the online interpretation of RDD generally belongs to follow others, parrot tongue, basically did not add their own understanding. Based on the thesis of the original author of Spark, the following paper makes a preliminary discussion on the core concept of Spark, RDD.

1.1 Resilient

The Chinese explanation is "recoverable; resilient, elastic." in our lives, a thing that is elastic means that it is not easy to damage, such as balls and tires, while Apple is applying for a patent for the iPhone. It is in the four corners of the phone that something similar to a rubber band is added to increase the fall resistance of the phone. The core data structure of Spark is flexible and resilient, indicating that spark considered applying spark to large-scale distributed clusters at the beginning of its design. Because of this large-scale cluster, any server may fail at any time. If the server where the computing subtask (Task) is located fails, then this subtask can no longer be executed on this server. At this point, the "flexibility" of RDD comes in handy, which enables the failed subtask to migrate within the cluster, thus ensuring a smooth transition of the whole task (Job) to the failed machine. Some students may have doubts, is there still a system that is not flexible and rigid? There are, for example, many immediate query systems, such as presto or impala, because the queries running on them have a second delay, so if the subtask fails, you can just rerun the query. While the tasks handled by spark may often have to run at the level of minutes or even hours, then the cost of a complete rerun of the whole task is very high, while the cost of some task reruns is relatively small, so the data structure of spark must be "flexible" and automatically fault-tolerant to ensure that the task is only run once.

1.2 Distributed

There is no need to explain the Chinese meaning of this English word, which means "distributed". So how exactly is the data structure of spark distributed? This involves the concept of partition in spark, that is, the segmentation rules of data. According to some specific rules, the subset of data can be processed in an independent task, and these task are executed in parallel on multiple servers in the cluster, which is the meaning of Distributed in spark. RDD in the spark source code is a base class for representing data, and many sub-RDD are derived from this base class. Different sub-RDD have different functions, but what they all need is the ability to be partition, for example, to read data from HDFS, then there will be hadoopRDD. The segmentation rule of this hadoopRDD is that if a HDFS file can be segmented according to block (64m or 128m), such as txt format, then one Block to one partition. Spark will generate a task for the Block to process the data of the Block, and if the file on the HDFS is not sharable, such as compressed zip or gzip format, then one file corresponds to a partition If the data is random when entering the database, but needs to be grouped (group) according to the key of the data, then the data needs to be distributed (shuffle) in the cluster according to the key of the data source, and the data of the same key is "classified" together. If all the key are put into the same partition, then there can only be one task for classification processing, and the performance will be very poor, so the parallelization of this process The parallelization of the group process is realized by segmenting the key so that different key are processed in different partition.

1.3 Datasets

Seeing this word, many people will mistakenly think that RDD is the data storage structure of spark, but this is not the case. Datasets in RDD is not a real "collection", let alone collection in java, but represents the logic of data processing in spark. How do you understand it? This needs to be understood by combining two concepts, the first is the transform operation of RDD in spark, and the other is pipeline in spark. First of all, take a look at RDD's transform and take a look at a transform diagram in the paper:

Conversion

Each rectangle in the figure is a RDD, but they represent a different data structure. Note that this is a "representation" rather than a "storage". For example, the RDD of lines is the original line of text, while the RDD of errors represents only lines of text that begin with "ERROR", while the RDD of HDFSerrors represents lines of text that contain the keyword "HDFS". This is a RDD "deformation" process. All right, let's go back to the tangled words "representation" and "storage" above and see what the different results will be if we express them in different words. If we use "storage", then the last RDD needs to be stored after transform, and then handed over to the next processing logic after all processing (similar to the way we used Xunlei to download movies a long time ago, you have to download before you can watch them, the two processes are serial). So the problem is that the next processing logic has to wait before a batch of data arrives, which is not necessary. So after the last processing logic processes a piece of data, if it is handed over to the next processing logic immediately, there will be no waiting process, and the overall system performance will be greatly improved, and this is exactly the effect expressed by the word "representation" (similar to the later streaming media, which does not need to download the movie first, but can watch while downloading), this is the pipeline (pipeline) processing method in spark.

2 lineage of spark

After the analysis of the three words of RDD, ball players may also have a question, that is, for the way pipeline is handled, it feels that the data of each processing logic is "hanging in the air" and is not as secure as falling on disk. Indeed, if this is the case, how can spark ensure that this "suspended" streaming data is "recoverable" after a server failure? This leads to another important concept in spark: lineage. The pedigree of a RDD is a series of processing logic like the one pictured above. Spark records the lineage for each RDD. To borrow the passage from Fan Wei's classic sketch, spark knows that each subset of RDD is "how not" (deformed) and that subset is "how did it come from" (metamorphosed). Then when the data subset is lost, spark will recover the lost data subset according to lineage. So as to ensure the elasticity of Datasets.

3 pay attention to

1) of course, if RDD is cache and checkpoint, it can be understood that spark "stores" the data of a RDD, which belongs to the content to be explained in the following optimization.

2) when RDD is in transform, it is not handed over to the next RDD for each item processed, but is delivered in a small batch, which also belongs to the optimized content.

This is the end of the alternative interpretation of how to analyze spark rdd. I hope the above content can be helpful to you and learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.