Why do Hadoop and spark sort key? 07/09 Update SLTechnology News&Howtos

Why do Hadoop and spark sort key?

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article introduces the relevant knowledge of "Why Hadoop and spark sort key". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

1. Thinking

As long as the principle of mapreduce in hadoop is clearly familiar with the following whole process operation principle, which involves at least three sorting, namely, overflow write quick sort, overflow write merge sort, reduce pull merge sort, and the sort is the default, that is, natural sorting, then why do you want to do so, what is the design reason. Give a conclusion first, in order to be more stable as a whole, the output meets most of the requirements, the former is reflected in not using hashShuffle but sortShuffle, while the latter is reflected in pre-calculation. You should know that the sorted data will be much more convenient when using subsequent data, such as where the index is reflected, such as when reduce pulls data.

Analysis of 2.MapReduce principle

Before analyzing the reasons for the design, first understand the whole process, in the map phase, according to the pre-defined partition rules for partition, map will first write the output to the cache, when the cache content reaches the threshold, the result will be spill to the hard disk, each time spill will generate a spill file on the hard disk, so a map task may produce multiple spill files, in which each spill will be sorted by key. Next, in the shuffle phase, when map writes out the last output, you need to perform a merge operation on the map side and merge and sort according to the key in partition and partition (merge + sort). At this time, each partition is sorted according to the key value as a whole. Then start the second merge, this time on the reduce side, during which the data are available on both memory and disk. In fact, the merge at this stage is not strictly sorted, but also a merge + sort similar to the previous one, just merge multiple overall ordered files into a large file, and finally complete the sorting work. After analyzing the whole process, do you feel that if you implement the MapReduce framework, consider using HashMap to output map content?

2.1 detailed explanation of MapTask operation mechanism

The whole flow chart is as follows:

Detailed steps:

First of all, the reading data component InputFormat (default TextInputFormat) will use the getSplits method to plan the logical slicing of the files in the input directory to get the splits, and the number of split corresponds to the number of MapTask started. The default correspondence between split and block is "right".

After the input file is divided into splits, the input file is read by the RecordReader object (default LineRecordReader), with\ n as the delimiter, the input data is read and returned. Key represents the offset value of the first character of each line, and value represents the text content of the line.

Read split returns, enter into the Mapper class inherited by the user, and execute the map function overwritten by the user. The RecordReader reads the read line once here.

After the map logic is completed, each result of the map is collected through context.write into the row collect data. In collect, it is partitioned first, and HashPartitioner is used by default. MapReduce provides Partitioner interface, and its function is to decide which reduce task should handle the current pair of output data according to the number of key or value and reduce. By default, the module is modeled by the number of reduce task after key hash. The default mode is only to average the processing power of reduce. If users have a need for Partitioner, they can customize it and set it to job.

Next, the data is written to memory, and this area of memory is called a ring buffer, which is used to collect map results in batches, reducing the impact of disk IO. Both our key/value pair and the result of Partition are written into the buffer. Of course, before writing input, both key and value values are sequenced into byte arrays.

The ring buffer is actually an array that stores the serialization data of key and value and the metadata information of key and value, including the starting position of partition, key, the starting position of value, and the length of value. Ring structure is an abstract concept.

The buffer has a size limit, and the default is 100MB. When there is a lot of map task output, memory may burst, so you need to temporarily write the data in the buffer to disk under certain conditions, and then re-use the buffer. This process of writing data from memory to disk is called Spill, which can be translated into overflow in Chinese. This overflow is done by a separate thread, which affects the thread that writes the map result to the buffer. The overflow thread should not block the output of the map when it starts, so the entire buffer has a "proportional" spill.percent for overwriting. This "ratio" defaults to 0.8. that is, when the data in the buffer has reached the threshold (buffer size * spillpercent = 100MB * 0.8 = 80MB), the overflow thread starts, locks the memory of the 80MB, and the output of the Maptask execution of the overflow process can also be written to the remaining 20MB memory, regardless of each other.

When the overflow thread starts, you need to Sort the key in this 80MB space. Sorting is the default "row" of the MapReduce model!

If job has set Combiner, now is the time to use Combiner. Add up the key/value pairs with the same key to reduce the amount of data overwritten to disk. Combiner optimizes the intermediate results of MapReduce, so it is used multiple times throughout the model.

What are the scenarios that will enable Combiner to be used? From this analysis, the output of Combiner is the output of Reducer, and Combiner can never change the final calculation result. Combiner should only be used for scenarios where the input key/value is exactly the same as the output key/value type for that kind of Reduce and does not necessarily affect the final result. For example, accumulation, maximum value, and so on. The use of Combiner must be careful. If used well, it is helpful to the efficiency of job, otherwise it will affect the final result of reduce.

Merge overflow files: each overflow will generate a temporary file on the disk (determine whether there is a combiner before writing). If the output of map is really large and there are multiple such overwrites, there will be multiple temporary files on the disk accordingly. When the whole data processing is finished, the temporary files in the disk are merged into the "line" merge, because there is only one final file, write to the disk, and provide this file with "a request file to record the offset of the data corresponding to each reduce".

2.2 detailed explanation of ReduceTask operation mechanism

Reduce skills can be roughly divided into three stages: copy, sort and reduce, with emphasis on the first two stages. The copy phase contains an eventFetcher to get the completed map column table, and the Fetcher thread goes to copy the data. During this process, two merge threads, inMemoryMerger and onDiskMerger, are started to merge the data in memory to disk and to merge the data in disk. After the data copy is completed, the copy phase is completed, and the "row" sort phase begins. The sort phase mainly implements the finalMerge operation, and the pure sort phase. After the completion, there is the reduce phase, which is processed by calling the user-defined reduce function. Detailed steps

2.2.1 Copy Pha

Simply pull the data. The Reduce process starts some data copy threads (Fetcher) and requests maptask to get its own files through HTTP.

2.2.2 Merge Pha

Merge phase. The merge of this "inside" is like the merge action on the map side, except that the values stored in the array are different from the copy on the map side. The data from Copy will be put into the memory buffer first, and the buffer size in this "inside" is "more flexible" than that on the map side. There are three forms of merge: memory to memory, memory to disk, and disk to disk. By default, the first form is not available. When the amount of data in memory reaches a certain threshold, start the merge from memory to disk. Similar to the map side, this is also the process of overwriting, in which if you set Combiner, it will also be enabled, and then generate a large number of overflow files on disk. The second merge method "keeps running" until there is no data on the map side, and then starts the third disk-to-disk merge mode to generate the final file.

2.2.3 merge sort

After merging the scattered data into one large data, the merged data will be sorted again. The reduce method is used for the sorted key-value pairs, and the reduce method is used for key-value pairs with equal keys. Each call will produce zero or more key-value pairs. Finally, these output key-value pairs are written into the HDFS file.

This is the end of the content of "Why Hadoop and spark sort key". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.