Detailed explanation of MapReduce process 07/06 Update SLTechnology News&Howtos

Detailed explanation of MapReduce process

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

MapReduce comes from a paper by Google, which fully draws lessons from the idea of "divide and conquer" and divides a data processing process into two steps: Map (mapping) and Reduce (reduction). To put it simply, MapReduce is "the decomposition of tasks and the summary of results".

MapReduce (MR) is a disk-based computing framework, the main reason for slow, slow:

1) MR is process-level, and a MR task creates multiple processes (map task and reduce task are both processes). Processes such as process creation and destruction take a lot of time.

2) MapReduce jobs are usually data-intensive, and a large number of intermediate results need to be written to disk and transmitted through the network, which consumes a lot of time.

Note: the mapreduce 1.x architecture has two processes:

JobTracker: responsible for resource management, job scheduling, monitoring TaskTracker.

TaskTracker: the executor of the task. Run map task and reduce task.

Yarn took their place at 2.x.

MapReduce workflow:

Files on HDFS-> InputFormat- > Map phase-> shuffle phase (across Mapper and Reducer, before Mapper outputs data and after Reducer receives data)-> Reduce phase-> OutputFormat-> HDFS:output.txt

InputFormat interface: split the input data. The size of the input part is generally the same as the blocksize of hdfs (128m).

Map phase: Map reads input shard data, and an input shard (input split) performs map logic processing (user-defined) for a map task.

Reduce phase: the reduce function is called on each key in the sorted output. The number of reduce task is set by setNumReduceTasks, that is, the default value of mapreduce.job.reduces parameter is 1. The output at this stage is written directly to the output file system, usually hdfs.

MapReduce Shffle detailed explanation

In order to ensure that the input of each reducer is sorted by key, the system performs the sorting process, that is, the output of the map task is passed to the reduce task through certain rules, which is called shuffle.

Part of the Shuffle phase is done in map task, which is called Map shuffle, and another part is done in reduce task, which is called Reduce shffle.

Map Shuffle stage

Map will open a ring buffer in memory when making the output. The default size is 100m (parameter: mapreduce.task.io.sort.mb). The outputCollect in Map will collect all the output kv pairs and store them in this ring buffer.

Ring buffer: essentially an end-to-end array that is split in two to write indexes and data at the same time. Once the content in the ring buffer reaches the threshold (default is 0.8, parameter: mapreduce.map.sort.spill.percent), a background thread will spill the content to disk. In the process, the map output will not stop writing data to the buffer (reverse write, when the threshold is reached, reverse, and so on), but if the buffer is full during this period, map will be blocked until the disk write process is complete. The overflow write process writes the contents of the buffer to a directory in the job-specific subdirectory specified by mapred.local.dir in a polling manner, and deletes it when the map task ends.

Understanding of related concepts:

Combiner: local reducer, running combiner makes map output more compact, reducing data written to disk and data passed to reducer. Can be customized programmatically (no definition, no default). Applicable scenarios: summation, times, etc. (scenarios that do the'+ 'method) [such as averages are not suitable].

Partitioner: partition. Divide the data into different regions according to certain rules. Partitioner determines which reduce task handles the data output by map task. The partition rules can be customized by programming. Generally, a partition corresponds to a reduce task. By default, it is partitioned according to the hashcode of key. Note: the default reduce task is set to 1, so the partition is not executed. When performing the partition operation, it will first determine whether the reduce task is greater than 1.

Spill: each overwrite generates an overwrite file (spill file), so after the map task finishes writing its last output record, there will be multiple overflow files. Before the Map task is completed, all spill file will be merged and sorted into a partitioned and ordered file. This is a multipath merging process, and the maximum number of merging paths is 10 by default (parameter: mapreduce.task.io.sort.factor). If combiner is defined and at least three (parameter: mapreduce.map.combine.minspills) overflow files exist, combiner will run again before the output file is written to disk. When the spill files are merged, Map deletes all temporary spill files, notifying appmaster that map task has been completed.

Map phase compression: compressing compressed map output while writing to disk speeds writing to disk, saving time, and reducing the amount of data passed to reducer. You can enable this feature by setting mapreduce.map.output.compress to true (default is false). The compression library used is specified by the parameter mapreduce.map.output.compress.codec. Note: at this time, it is recommended to give priority to the more efficient compression mode.

Reduce Shuffle stage

Reducer is the partition of the output file obtained by HTTP. Using netty for data transfer (RPC protocol), the number of worker threads in netty is twice the number of processors by default. A reduce task corresponds to a partition.

Before the reduce side gets all the map output, the thread on the reduce side periodically asks appmaster about the map output. App Master knows the relationship between the output of map and host. Before the reduce side gets all the map output, the thread on the reduce side periodically asks master about the map output. Reduce does not delete hosts as soon as it gets the map output, because reduce may fail to run. Instead, wait for the delete message of appmaster to decide to delete the host.

When the number of map tasks completed accounts for 0.05 of the total map task (parameter: mapreduce.job.reduce.slowstart.completedmaps), the reduce task begins to copy its output, and the copy phase copies the Map output to Reducer memory or disk. The number of replication threads is determined by the mapreduce.reduce.shuffle.parallelcopies parameter, which defaults to 5.

If the map output is quite small, it is copied to the memory of the reduce task JVM (the buffer size is controlled by the mapreduce.reduce.shuffle.input.buffer.percent attribute, which specifies the percentage of heap space used for this purpose, default is 0.7), and if the buffer space is insufficient, the map output is copied to disk. Once the memory buffer reaches the threshold (parameter: mapreduce.reduce.shuffle.merge.percent, default is 0.66) or reaches the output threshold of map (parameter: mapreduce.reduce.merge.inmem.threshold, default is 1000), it is merged and overwritten to disk. If you specify combiner, running it during the merge reduces the amount of data written to disk. As there are more copies on disk, background threads merge them into larger, sorted files. Note: in order to merge, compressed map output must be decompressed in memory.

After all the map output is copied, the reduce task enters the merge sort phase, which merges the map output and maintains its order. This is done in a cycle. The goal is to merge the files with the minimum amount of data so that the last trip just meets the merge factor (parameter: mapreduce.task.io.sort.factor, default 10).

Therefore, if you have 40 files (including disk and memory), you will not merge 10 files per trip in four trips to get 4 files, and then merge 4 files into reduce. Instead, only 4 files were merged in the first trip, and then 10 files were merged in Santang. On the last trip, four merged files and the remaining six files totaling ten files were merged directly into reduce.

This does not change the number of merges, it is just an optimization measure to minimize the amount of data written to disk. Because the last trip is always merged directly into reduce, there is no disk round trip.

At this point, the Shuffle phase is over.

Shuffle summary

1) map task collects the kv pairs output by the map () method and puts them in the memory ring buffer

2) continuously partition, sort, combine (optional) overwrite (spill) files from the memory ring buffer to the local disk

3) multiple overflow files will be merged and sorted into large spill file

4) reduce task goes to each map task machine to get the corresponding result partition data according to its own partition number.

5) reduce task will get the result files from different maptask of the same partition, and reduce task will merge and sort these files

6) after merging into large files, the shuffle process ends

MapReduce tuning

Input phase: deal with small file problems:

Map phase:

1) reduce the number of spill.

2) reduce the number of merge.

3) set combine without affecting the business logic.

4) enable compression.

Reduce phase:

1) set the number of map and reduce reasonably.

2) set the coexistence of map and reduce reasonably.

3) avoid using reduce: because reduce will incur a lot of network consumption when it is used to connect datasets.

4) set the buffer on the reduce side reasonably: by default, when the data reaches a threshold, the data in buffer will be written to disk, and then reduce will get all the data from disk. In other words, there is no direct relationship between buffer and reduce, and there is a process of writing disk-> reading disk. Since there is this drawback, you can configure parameters so that part of the data in buffer can be directly sent to reduce, thus reducing IO overhead.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.