What is the process of MapTask and ReduceTask 07/19 Update SLTechnology News&Howtos

What is the process of MapTask and ReduceTask

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article focuses on "what is the process of MapTask and ReduceTask". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn what the MapTask and ReduceTask process is like.

The process between map- > reducemap and reduce becomes shuffling, as described in the official diagram. (this description is not very accurate.)

MapTask

Each map task has a ring memory buffer to store the output of the task. Default 100MB (MRJobConfig.IO_SORT_MB modification)

Once the buffer reaches the threshold (MRJobConfig.MAP_SORT_SPILL_PERCENT) 0.8, the background thread spill the contents to the hard disk and writes the buffer zone to the MRJobConfig.JOB_LOCAL_DIR specified directory.

Check the MRJobConfig.JOB_LOCAL_ Dir value of mapreduce.job.local.dir, view the mapred-default.xml (in hadoop-mapreduce-client-core.2.7.1.jar) file under the org.apache.hadoop.mapreduce package, search local.dir, and get the configuration.

Mapreduce.cluster.local.dir ${hadoop.tmp.dir} / mapred/local The local directory where MapReduce stores intermediate data files. May be a comma-separated list of directories on different devices in order to spread disk i/o. Directories that do not exist are ignored.

Ok, now search for hadoop.tmp.dir from core-default.xml in hadoop-common-2.7.1.jar

Hadoop.tmp.dir / tmp/hadoop-$ {user.name} A base for other temporary directories.

Now we have the temporary path / tmp/hadoop-$ {user.name} / mapred/local for spill.

Before spill, partition is done first, and each partition is sort, and if there is a combiner, it executes combiner after sorting.

If there are more than three overflow files (JobContext.MAP_COMBINE_MIN_SPILLS), combiner will be executed again

Source code in MapTask.MapOutputBuffer

If (combinerRunner = = null | | numSpills < minSpillsForCombine) {Merger.writeFile (kvIter, writer, reporter, job);} else {combineCollector.setWriter (writer); combinerRunner.combine (kvIter, combineCollector);}

Note: when map spill to disk, you can set compression to save disk and network IO

Set MAP_OUTPUT_COMPRESS to true and MRJobConfig.MAP_OUTPUT_COMPRESS_ CODEC value to codec

For example:

Conf.set (MRJobConfig.MAP_OUTPUT_COMPRESS, "true")

Conf.set (MRJobConfig.MAP_OUTPUT_COMPRESS_CODEC, "org.apache.hadoop.io.compress.DefaultCodec")

ReduceTaskReduceTask reads data from each MapTask, and the ReduceTask process is generally divided into five stages.

Shuffle

ReduceTask copies data remotely from the MapTask. The disk is written over the threshold.

Merge

ReduceTask starts two threads to merge memory and hard disk data.

Sort

Merge and sort the results of MapTask.

Reduce

User-defined Reduce

Write

Reduce result is written to HDFS

At this point, I believe you have a deeper understanding of "what the MapTask and ReduceTask process is like". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.