Example Analysis of Mapreduce shuffle 07/19 Update SLTechnology News&Howtos

Example Analysis of Mapreduce shuffle

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article shares with you the content of the sample analysis of Mapreduce shuffle. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

Mapreduce shuffle detailed explanation

Mapreduce ensures that the input of each reducer is sorted by key. The process by which the system performs sorting (passing the map output to reducer as input) becomes shuffle. In many ways, shuffle is the heart of mapreduce, the place where miracles happen.

The figure above shows the detailed process of mapreduce.

1 input slicing

The input slicing of data should be introduced differently according to different storage formats. For files stored in hdfs, the fragmentation of data can be divided into two types. Files that can be sliced (uncompressed or compressed format bzip2) are sliced according to a certain size. The default is the size of block. The specific algorithm is not detailed here. The previous hive tuning article also mentioned that, and Lang Tip will also mention this content in subsequent articles.

An example of the calculation process of the calculation formula when slicing

If a file is not sliced, then a file is divided into pieces.

2 Map end

From the image above, we can see the processing process on the map side. Map reads the input shard data. But the map function doesn't simply write the data to disk when it starts to produce output. The process is so complicated that he uses buffering to write to memory and sorts it for efficiency reasons.

Each map task is a ring buffer used to store the output of the task. By default, the size of the buffer is 100MB, and resignation can be adjusted by changing io.sort.mb. Once the buffer content reaches the threshold (io.sort,spill,percent, the default is 0. 8), a background thread spill the content to disk. The map output does not stop writing data to the buffer during the spill to disk process, but if the buffer is full during that time, the map will be blocked until the disk write process is complete.

Overflow write process installation polling writes the contents of the buffer to a directory in a job-specific subdirectory specified by mapred.local.dir.

Before writing to the disk, the thread first divides the data into corresponding partitions according to the reducer of the data to be transmitted. The background thread presses the internal sort in each partition, and if there is a combiner, it runs on the sorted output. Running combinner makes the map output more compact, so you can reduce the data written to disk and passed to reducer.

Each time the memory buffer reaches the overflow threshold, a new overflow file (spill file) is created, so there are several overflow files after the map task finishes writing its last output record. Before the task is completed, the overflow file is merged into a partitioned and sorted output file. The configuration property io.sort.factor controls how many streams can be merged at a time, and the default is 10.

If there are at least three overflow files (through the min.num.spills.for.combine property setting), combiner will run again before the output file is written to disk. As mentioned earlier, combiner can be run repeatedly on input without affecting the final result. If there are only one or two overflow files, then the reduction in map output is not worth calling combiner, and combiner will not be run again for map output.

It is often a good idea to compress the compressed map output as it is written to disk, because it will write the disk faster, save time, and reduce the amount of data passed to the reducer. By default, the output is not compressed, but this feature can be enabled as long as mapred.compress.map.output is set to true. The compression library used is specified by mapred.map.output.compression.codec.

Reducer is the partition of the output file obtained by HTTP. Using netty for data transfer in MRV2, the number of worker threads in netty is twice the number of processors by default. In MRV1, the default value is 40, which is set by tracker.http.threads on the tasktracker side.

3 Reducer end

In a cluster, a mr task often has several map tasks and reduce tasks, and the map task runs fast and slow. It is impossible for reduce to wait until all map tasks are finished before starting, so as long as one task is completed, the reduce task starts the replicator output. The number of replication threads is changed by the mapred.reduce.parallel.copies property, which defaults to 5.

How does Reducer know about the map output? For MRv2 map, the appmaster is notified directly after the end of the run, and for a given job appmaster, the relationship between the output of map and host is known. Before the reduce side gets all the map output, the thread on the reduce side periodically asks master about the map output. Reduce does not delete the hosts as soon as it gets the map output, because the reduce fails to run. Instead, wait for the delete message of appmaster to decide to delete the host.

Reduce also has corresponding tuning processing for different sizes of map output. If the map output is quite small, it will be copied to the memory of the reduce task JVM (the buffer size is controlled by the mapred.job.shuffle.input.buffer.percent property, which specifies the percentage of heap space used for this purpose), otherwise, the map output will be copied to disk. Once the memory buffer reaches the threshold (determined by mapred.job.shuffle.merge.percent) or the output threshold of map (controlled by mapred.inmem.merge,threshold), the merged overflow is written to disk. If you specify combiner, running it during the merge reduces the amount of data written to disk.

As there are more copies on disk, background threads merge them into larger, sorted files. This will save some time for later mergers. Note that in order to merge, the compressed map output (through the map task) must be decompressed in memory.

After all the map output is copied, the reduce task enters the sorting phase (more appropriately, the merge phase, because sorting is done on the map side), which merges the map output and maintains its order. This is done in a cycle. For example, if you have 50 map outputs and the merge factor is 10 (the default is 10, set by the io.sort.factor property, similar to the merge of map), the merge will take place five times. Ten files are merged into one file per trip, so there are five intermediate files in the end.

In the final stage, the reduce phase, the data is entered directly into the reduce function, thus omitting a disk round trip and not merging the five files into the last trip of a sorted file. The final merge can come from memory and disk fragments.

During the reduce phase, the reduce function is called on each key in the sorted output. The output at this stage is written directly to the output file system, usually hdfs.

Note:

The number of files per merge is actually different from that shown in the example above. The goal is to merge the files with the minimum amount of data to meet the merge coefficient of the last trip. Therefore, if there are 40 files, we will not merge 10 files each of the four trips and get 4 files. On the contrary, only 4 files were merged in the first trip, and 10 files were merged in Santang. On the last trip, the four merged files and the remaining six files were merged into one file. As shown in the following figure:

Note that this does not change the number of merges, it is just an optimization measure to minimize the amount of data written to disk, because the last trip is always merged directly into reduce.

Thank you for reading! This is the end of this article on "sample Analysis of Mapreduce shuffle". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.