The working process of MapReducer in Hadoop 07/15 Update SLTechnology News&Howtos

The working process of MapReducer in Hadoop

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "the working process of MapReducer in Hadoop". In the operation of actual cases, many people will encounter such a dilemma. Next, let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

1. From input to output

A MapReducer job goes through five stages of input,map,combine,reduce,output, in which the combine phase does not necessarily occur, and the intermediate result of the map output is divided into the process of reduce as shuffle (data cleaning).

Copy (replication) and sort (sorting) also occur during the shuffle phase.

In the process of MapReduce, a job is divided into two computing phases, Map and Reducer, which are composed of one or more Map tasks and Reduce tasks. As shown in the following figure, a MapReduce job can be divided into Map tasks and Reduce tasks from the direction of data flow. When a user submits a MapReduce job to Hadoop, JobTracker will take into account the resource surplus, job priority, job submission time and other factors of TaskTracker according to the heartbeat information sent periodically by each TaskTracker, and assign appropriate tasks to TaskTracker. By default, Reduce tasks will not start until 5% of the number of Map tasks is complete.

The execution process of the Map task can be summarized as follows: first, the input file is sliced and parsed into key-value pairs as the input of the map function through the getSplits method and next method in the user-specified InputFormat class. Then the map function passes the intermediate result to the specified Partitioner for processing, ensuring that the intermediate result is distributed to the specified Reduce task for processing. If the user specifies Combiner, the combine operation will be performed. Finally, the map function saves the intermediate results locally.

The execution process of the Reduce task can be summarized as follows: first, the intermediate results of the completed Map task need to be copied to the node where the Reduce task is located, and then sorted by key after the data replication is completed. Through sorting, all the data with the same key will be handed over to the reduce function for processing. After the processing is completed, the results will be directly output to the HDFS.

2. Input

If you use a file on HDFS as the input of MapReduce, the MapReduce computing framework will first use the subclass FileInputFormat of the org.apache.hadoop.mapreduce.InputFomat class to split the file on the input HDFS to form an input InputSplit, each InputSplit will be used as the input of a Map task, and then parse the InputSplit into key-value pairs. The size and number of InputSplit have a great impact on the performance of MaoReduce jobs.

InputSplit only logically shards the input data and does not slice the file on disk for storage. InputSplit only records the metadata node information of the shard, such as the starting position, the length and the list of nodes. The algorithm for data segmentation needs to determine the number of InputSplit. For files on HDFS, the FileInputFormat class uses the computeSplitSize method to calculate the size of the InputSplit. The code is as follows:

}

The minSize is determined by the configuration item mapred.min.split.size in the mapred-site.xml file, and the default is 1 mapred.min.split.size size determined by the configuration item mapred.max.split.size in the mapred-site.xml file, which defaults to 9223 372 036 854 775 807, while blockSize is determined by the configuration item dfs.block.size in the hdfs-site.xml file, defaulting to 67,108,864 bytes (64m). So the formula for determining the size of InputSplit is:

Compressed format tool algorithm file extension whether multiple files can be split DEFLATE*N/ADEFLATE.deflate No GzipgzipDEFLATE.gz No bzip2bzip2bzip2.bz2 No LZOLzopLZO.lzo No

The intermediate result of the map output is stored in IFile,IFile, a storage format that supports navigational compression and supports the above compression algorithm.

Reducer gets the partition of the output file through Http. The number of worker threads that send intermediate results of map output to Reducer is determined by the tasktracker.http.threds configuration item of the mapred-site.xml file, which is for each node, not for each Map task, and the default is 40, which can be increased according to job size, cluster size, and node computing power.

4. Shuffle

Shuffle, also known as data cleaning. In some contexts, the whole process of representing the map function to generate digested input output to reduce.

4.1 copy Pha

The output of the Map task is on the local disk of the node where the TaskTracker of the Map task resides. TaskTracker needs to run the Reduce task for these partition files (map output). However, Reduce tasks may require the output of multiple Map tasks as their special partition files. The completion time of each Map task may be different, and when one task completes, the Reduce task begins to copy its output. This is the copy phase of shuffle. As shown in the following figure, the Reduce task has a small number of replication threads and can get the output of the Map task in parallel. The default value is 5 threads, which can be changed by setting the configuration item of mapred-site.xml 's mapred.reduce.parallel.copies.

If the map output is fairly small, it is copied to a buffer in the memory of the TaskTracker where the Reduce is located, and the size of the buffer is specified by the mapred.job.shuffle.input.buffer.percent configuration item in the mapred-site.xml file. Otherwise, the map output will be copied to disk. Once the memory buffer reaches the threshold size (determined by the mapred-site.xml file mapred.job.shuffle.merge.percent configuration item) or the number of files in the buffer reaches the threshold size (determined by the mapred-site.xml file mapred.inmem.merge.threshold configuration item), the merge is overwritten to disk.

4.2 sort Pha

As more files are overwritten to disk, shuffle proceeds to the sort phase. This phase merges the output files of map and maintains their order, which is actually merged and sorted. The sorting process is cyclical. If there are 50 map output files and the merge factor (determined by the io.sort.factor configuration item of the mapred-site.xml file, the default is 10) is 10, the merge operation will be carried out 5 times, each time 10 files will be merged into one file, and finally there will be 5 files. These 5 files will not be merged because they do not meet the merge condition (the number of files is less than the merge factor). Five files will be handed over directly to the Reduce function for processing. The shuffle phase is complete at this point.

As you can see from the shuffle process, the Map task is dealing with an InputSplit, while the Reduce task is dealing with the intermediate results of all Map tasks in the same partition.

5. Reduce and output of the final result

The essence of reduce stage operation is to call the reduce function to deal with the files processed by shuffle. Due to the shuffle processing, the files are all partitioned by keys and orderly, and the reduce function is called for processing the files of the same partition once.

Unlike the intermediate result of map, the output of reduce is generally HDFS.

6. Sort

Sorting runs through Map tasks and Reduce tasks, and sorting operations are the default behavior of the MapReduce computing framework, regardless of whether the process needs it or not. In the MapReduce computing framework, two sorting algorithms are mainly used: quick sort and merge sort.

A total of three sorting operations occurred during the Map task and the Reduce task.

(1) when the map function produces output, it will first write to the ring buffer of memory, and when the set threshold is reached, the background thread will partition the data of the buffer into corresponding partitions before brushing the disk. In each partition, the background thread presses the key to sort internally. This is shown in the following figure.

(2) before the completion of the Map task, there are several overwritten files on the disk that have been partitioned and sorted side by side, with the same size as the buffer, and then the overwritten files will be merged into a partitioned and sorted output file. Because the overflow file has been sorted once, you only need to sort the file again when merging the file to make the output file as a whole orderly. This is shown in the following figure.

(3) in the shuffle phase, you need to merge the output files of multiple Map tasks. Due to the second sort, you only need to sort the files once to make the output files orderly as a whole.

The first of these three sorts is done in the memory buffer, using fast sorting algorithm; the second sort and the third sort occur in the file merge phase, using merge sort.

7. The progress composition of the job

When a MapReduce job runs on Hadoop, the client screen usually prints the job log, as follows:

For a large MapReduce job, the execution time may be long, so it is important to know the running status and progress of the job through the log. For Map, progress represents the percentage of input actually processed. For example, map 60% reduce 0% means that the Map task has processed 60% of the job input file, while the Reduce task has not yet started. As for the progress of Reduce, the situation is more complicated. We know from the previous that the reduce stage is divided into copy,sort and reduce, and these three steps together constitute the progress of reduce, each accounting for 1x3. If the reduce has processed the input of 2Accord 3, then the progress of the entire reduce should be 1 reduce 3 + 1 hand 3 + 1 hand 3 * (2 hand 3) = 5 reduce 9, because copy and sort are completed by the time the reduce starts processing.

This is the end of the content of "the working process of MapReducer in Hadoop". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.