What is the name of the process in which the system performs sorting in hadoop 07/08 Update SLTechnology News&Howtos

What is the name of the process in which the system performs sorting in hadoop

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article is to share with you what the process of sorting the system in hadoop is called. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

MapReduce ensures that the input to each reducer is sorted by key, and the process by which the system performs sorting is called shuffle. The shuffle phase mainly includes the combine, group, sort, partition and reducer merge sorting of the map phase.

The operating environment of this tutorial: windows7 system, Dell G3 computer.

MapReduce ensures that the input to each reducer is sorted by key, and the process by which the system performs sorting is called shuffle. We can understand the whole project for map to generate digested input output to reduce.

Map side: each mapperTask has a ring memory buffer, which is used to store the output of map tasks. Once the threshold is reached, a background thread writes the contents to a newly created overflow write file in the specified directory of the disk, which goes through partition, sort and Combiner before writing to the disk. When the final record is finished, merge all overwritten files into a partition and sorted file.

Reduce side: can be divided into replication phase, sorting phase, reduce phase

Replication phase: the map output file is located on the local disk of the tasktracker running the map task. Reduce obtains the partition of the output file by http, and tasktracker runs the reduce task for the partition file. As long as a map task is completed, the reduce task begins to copy the output.

Sorting phase: it is more appropriate to call it the merge phase, because sorting is done on the map side. This phase will merge the map output, maintain its order, and proceed in a loop.

The final stage is the reduce phase, where the reduce function is called for each key in the sorted output, and the output of this stage is written directly to the output file system, typically hdfs. 、

Shuffle stage description

The shuffle phase mainly includes the combine, group, sort, partition and reducer merge sorting of the map phase. After passing shuffle in the Map phase, the output data will be saved according to the division of the reduce, and the contents of the file will be sorted according to the defined sort. ApplicationMaster is notified when the Map phase is completed, and then AM notifies Reduce to pull the data, and the shuffle process on the reduce side is carried out during the pull process.

Note: the output data of the Map phase is stored on the disk running the Map node. It is a temporary file, not on the HDFS. After the data is pulled by the Reduce, the temporary file will be deleted. If it exists on the hdfs, it will cause a waste of storage space (three copies will be generated).

User-defined Combiner

Combiner can reduce the number of intermediate output results in the Map phase and reduce the network overhead. There is no Combiner by default. User-defined Combiner requires that it is a subclass of Reducer and takes the output of Map as the input and output of Combiner, that is, the input and output of Combiner must be the same.

The handling class of combiner can be set through job.setCombinerClass, and the MapReduce framework does not guarantee that the methods of this class will be called.

Note: if the input and output of reduce are the same, you can directly use the reduce class as the combiner

User-defined Partitioner

Partitioner is used to determine which node the corresponding processing reducer of the map output is. The default number of reduce for MapReduce tasks is 1, and Partitioner has no effect at this time, but when we change the number of reduce to multiple, partitioner will determine the node sequence number of the reduce corresponding to key (starting with 0).

You can specify the Partitioner class through the job.setPartitionerClass method, which uses HashPartitioner by default (the hashCode method of key is called by default).

User-defined Group

GroupingComparator is a key class for grouping and grouping the output of Map. To put it bluntly, it is used to determine whether key1 and key2 belong to the same group, and if so, to combine the output value of map.

Our custom class is required to implement the self-interface RawComparator, and the comparison class can be specified through the job.setGroupingComparatorClass method. WritableComparator is used by default, but eventually the compareTo method of key is called for comparison.

User-defined Sort

SortComparator is the key class used to sort the output of Map for key. To put it bluntly, it is used to determine which group key1 belongs to and which group key2 belongs to.

Our custom class is required to implement the self-interface RawComparator, and the comparison class can be specified through the job.setSortComparatorClass method. WritableComparator is used by default, but eventually the compareTo method of key is called for comparison.

User-defined Shuffle of Reducer

When pulling the output data of map on the reduce side, shuffle (merge sorting) will be performed. The MapReduce framework provides a custom way in plug-in mode. We can specify custom shuffle rules by implementing the API ShuffleConsumerPlugin and specifying the parameter mapreduce.job.reduce.shuffle.consumer.plugin.class, but in general, we directly use the default class org.apache.hadoop.mapreduce.task.reduce.Shuffle.

Thank you for reading! This is the end of this article on "what is the process of sorting the system in hadoop?". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it out for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.