What is the working mechanism of MapReduce 10/23 Update SLTechnology News&Howtos

What is the working mechanism of MapReduce

2025-10-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article mainly explains "what is the working mechanism of MapReduce". Interested friends may wish to take a look at it. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "what is the working mechanism of MapReduce"?

Analyze the running Mechanism of MapReduce Job

Static diagram:

Dynamic graph:

Progress and status updates

Fail

User code in a map or reduce task throws a run exception

JVM abruptly quit

Node Manager failed to run

Resource Manager failed to run

For high availability, HA, in dual hot standby mode, it is necessary to run a pair of resource managers

Shuffle and sorting

MapReduce ensures that the input of each reducer is sorted by key. The process by which the system performs sorting and passes the map output as input to reducer is called shuffle.

Map end

When the map function starts to produce output, it doesn't simply write it to disk. This process is more complex, which uses buffering to write to memory and pre-sorts for efficiency reasons.

Each map task has a ring memory buffer for storing task output. By default, the buffer size is 100Mb, which can be adjusted by changing the mapreduce.task.io.sort.mbproperty. As soon as the flush content reaches the threshold (mapreduce.map.sort.spill.percent, the default is 0.80,80%), a background thread starts to spill the content to disk. During the overflow write to disk, the map output continues to be written to the buffer, but if the buffer fills up during that time, the map is blocked until the disk write process is complete. The overflow write process polls the contents of the buffer to the directory specified by the mapreduce.cluster.local.dir property under the job-specific subdirectory.

Before writing to the disk, the thread first divides the data into corresponding partition according to the reducer of the data to be transmitted. In each partition, background threads press keys to sort in memory, and if there is a combiner function, it runs on the sorted output. Running the combiner function makes the map output more compact, thus reducing the data written to disk and passed to reducer.

Each time the memory buffer reaches the overflow threshold, a new overflow file (spill file) is created, so there are several overflow files after the map task finishes writing its last output record. Before the task is completed, the overflow file is merged into a partitioned and sorted output file. The configuration property mapreduce.task.io.sort.factor controls how many streams can be merged at most at a time, and the default value is 10.

If there are at least three overflow files (through the mapreduce.map.combine.minspills property setting), combiner will run again before the output file is written to disk. As mentioned earlier, combiner can be run repeatedly on input, but it does not affect the final result. If there are only one or two overflow files, then due to the reduced size of the map output, it is not worth the overhead of calling combiner, so combiner will not be run again for that map output.

It is often a good idea to compress compressed map output as it is written to disk, because it will write disk faster, save disk space, and reduce the amount of data passed to reducer. By default, the output is not compressed, but you can easily enable this feature as long as mapreduce.map.output.compress is set to true. The compression library used is specified by mapreduce.map.output.compress.codec.

Reducer gets the partition of the output file through HTTP. The number of worker threads used for file partitions is controlled by the task's mapreduce.shuffle.max.threads property, which is set for each node manager, not for each map task. The default value of 0 sets the maximum number of threads to twice the number of processors in the machine.

Provide as much memory space as possible for the shuffle process

Write map and reduce functions with as little memory as possible, and should not use memory indefinitely

On the map side, you can achieve the best performance by avoiding multiple overflows to write to the disk

On the reduce side, the best performance can be achieved when all the intermediate data resides in memory.

Reduce end

Configuration tuning

Execution of tasks

It is speculated that during parallel execution, Hadoop will not try to diagnose or fix slow tasks for slow tasks. Instead, it will try to detect and start another same task as a backup when one task runs slower than expected. This is the so-called "speculative execution"

OutputCommitters

Hadoop MapReduce uses a submission protocol to ensure that both jobs and tasks are fully successful or failed. This behavior is achieved by using OutputCommitters for the job

At this point, I believe you have a deeper understanding of "what is the working mechanism of MapReduce". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.