What is the Shuffle mechanism of MapReduce? 07/04 Update SLTechnology News&Howtos

What is the Shuffle mechanism of MapReduce?

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces the relevant knowledge of "what is the Shuffle mechanism of MapReduce". The editor shows you the operation process through an actual case, and the operation method is simple, fast and practical. I hope this article "what is the Shuffle mechanism of MapReduce" can help you solve the problem.

Shuffle process, also known as Copy stage. Reduce task remotely copies a piece of data from each map task, and for a certain piece of data, if its size exceeds a certain threshold, it is written to disk, otherwise it is directly placed in memory.

MAP end

When the map function starts to produce output, it doesn't simply write it to disk. This process is more complex, which uses buffering to write to memory and pre-sorts for efficiency purposes.

Each map task has a ring buffer for storing task output. By default, the size of the buffer is 100MB, which can be adjusted through the mapreduce.task.io.sort.mb property. Once the buffer content reaches the threshold (mapreduce.map.sort.spill.percent, the default is 80%), a background thread begins to spill the content to disk. During the overflow to disk, the map output continues to write the buffer, but if the buffer is full during this period, the map will be blocked until the disk process is complete. The overflow write process polls the contents of the buffer to the directory specified by the mapreduce.cluster.local.dir property under the job-specific subdirectory. Before writing to the disk, the thread first divides the data into corresponding partitions according to the reducer to be passed (partition, the user can also customize the partition function, but the default partitioner partitions through the hash function, which is also very efficient). In each partition, background threads press keys to sort in memory, and if there is a combiner function, it runs on the sorted output. Running the combiner function makes the map output more compact, thus reducing the data written to disk and passed to reducer.

Each time the memory buffer reaches the overflow threshold, a new overflow file (spill file) is created, so there are several overflow files after the map task finishes writing its last output record. Before the task is completed, the overwrite file is merged into a partitioned and sorted output file. The configuration property is that mapreduce.task.io.sort.factor controls how many streams can be merged at a time, and the default value is 10. 0.

If there are at least three overwritten files (set by the mapreduce.map.combine.minspills property), combiner will run again before the output file is written to disk. Combiner can be run repeatedly on input, but does not affect the final result. If there are only one or two overwritten files, then due to the reduced size of the map output, it is not worth the overhead of calling combiner, so combiner will not be run again for that map output.

It is often a good idea to compress the compressed map output as it is written to disk, because it is faster to write to disk, saves disk space, and reduces the amount of data passed to reducer. By default, output is not compressed, but you can easily use this feature as long as mapreduce.map.output.compress is set to true. The compression library used is specified by mapreduce.map.output.compress.codec.

Reducer gets the partition of the output file through HTTP. The number of worker threads used for file partitions is controlled by the task's mapreduce.shuffle.max.threads property, which is set for each node manager, not for each map task. The default value of 0 sets the maximum number of threads to twice the number of processors in the machine.

REDUCE end

Now go to the reduce part of the process. The map output file is located on the local disk of the tasktracker running the map task (note that although the map output is often written to the local disk of map tasktracker, the reduce output is not), and now tasktracker needs to run the reduce task for the partition file. Also, the reduce task requires the map output of several map tasks on the cluster as its special partition files. The completion time of each map task may be different, so when each task completes, the reduce task begins to copy its output. This is the replication phase of the reduce task. The reduce task has a small number of replication threads, so it is possible to get map output in parallel. The default value is 5 threads, but this default value can be modified to set the mapreduce.reduce.shuffle.parallelcopies property.

If the map output is fairly small, it is copied to the memory of the reduce task JVM (the buffer size is controlled by the mapreduce.reduce.shuffle.input.buffer.percent property, which specifies the percentage of heap space used for this purpose), otherwise, the map output is copied to disk. Once the memory buffer reaches the threshold size (determined by mapreduce.reduce.shuffle.merge.percent) or the map output threshold (controlled by mapreduce.reduce.merge.inmen.threshold), the merged overflow is written to disk. If you specify combiner, run it during the merge to reduce the amount of data written to the hard disk.

As there are more copies on disk, background threads merge them into larger, ordered files. This will save some time for later mergers. Note that in order to merge, the compressed map output (through the map task) must be unzipped in memory.

After all the map output is copied, the reduce task enters the sorting phase (more appropriately, the merge phase, because sorting is done on the map side), which merges the map output and maintains its order. This is done in a cycle. For example, if you have 50 map outputs and the merge factor is 10 (10 is the default, set by the mapreduce.task.io.sort.factor property, similar to the merge of map), the merge will take place five times, merging 10 files into one file each time, so there are five intermediate files at the end.

In the final stage, the reduce phase, the data is entered directly into the reduce function, thus omitting a disk round trip and not merging the five files into a sorted file as the last trip. The final merge can come from memory and disk fragments.

❝

The number of files per merge is actually different from that shown in the case. The goal is to merge the minimum number of files in order to satisfy the merge factor of the last trip. So if there are 40 files, we don't merge 10 files in each of four trips to get four files. Instead, only four files were merged in the first round, and the complete 10 files were merged in the next three rounds. In the last trip, 4 merged files and the remaining 6 (unmerged) files totaled 10.

During the reduce phase, the reduce function is called on each key in the sorted output. The output at this stage is written directly to the output file system, usually HDFS (customizable). With HDFS, because the node manager also runs the data node, a copy of the first block is written to the local disk.

This is the end of the content about "what is the Shuffle mechanism of MapReduce". Thank you for your reading. If you want to know more about the industry, you can follow the industry information channel. The editor will update different knowledge points for you every day.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.