MapReduce execution flow chart of Hadoop 07/15 Update SLTechnology News&Howtos

MapReduce execution flow chart of Hadoop

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

The MapReduce shuffle process of Hadoop is very important. Only by being familiar with the whole process can you know the business like the back of your hand.

MapReduce execution process

Enter and split:

It is not the main process of map and reduce, but it is part of the time consuming of the entire computing framework, which prepares the data for the formal map.

Split operation:

Split simply splits the contents of the source file into a series of InputSplit, each of which stores data information corresponding to the shard (for example, file block information, starting position, data length, list of nodes). Instead of splitting the source file into multiple small files, each InputSplit is subsequently processed by a mapper

The parameter of each shard size is very important. SplitSize is a very important parameter that makes up the sharding rule. This parameter is determined by three values:

The minimum value of minSize:splitSize, determined by the mapred.min.split.size parameter in the mapred-site.xml configuration file.

The maximum value of maxSize:splitSize, determined by the mapreduce.jobtracker.split.metainfo.maxsize parameter in the mapred-site.xml configuration file.

The fast size of file storage in blockSize:HDFS is determined by the dfs.block.size parameter in the hdfs-site.xml configuration file.

Determination rule of splitSize: splitSize=max {minSize,min {maxSize,blockSize}}

Data format (Format) operation:

Format the partitioned InputSplit into key-value pairs of data. Where key is the offset and value is the content of each line.

It is worth noting that during the execution of the map task, data formatting operations are continuously performed, and every time a key-value pair is generated, it is passed into map for processing. So there is no time difference between map and data formatting, but at the same time.

2) Map mapping:

This is where the parallel nature of Hadoop is brought into full play. According to the user-specified map procedure, MapReduce attempts to execute the map program on the machine where the data resides. In HDFS, file data is copied in multiple copies, so the calculation will select the most idle node that owns this data.

In this part, the specific implementation process within map can be customized by the user.

3) Shuffle distribution:

The Shuffle process refers to the whole process that the direct output result produced by Mapper, after a series of processing, becomes the final Reducer direct input data. This is the core process of mapreduce. The process can be divided into two stages:

Shuffle on the Mapper side: the results generated by Mapper are not directly written to disk, but are first stored in memory. When the amount of data in memory reaches the set threshold, it is written to the local disk at one time. And sort (sorting), combine (merging), partition (fragmentation) and other operations are carried out at the same time. Among them, sort sorts the results produced by Mapper according to key value, combine merges records with the same key value, and partition allocates data evenly to Reducer.

Shuffle on the Reducer side: since Mapper and Reducer often do not run on the same node, Reducer needs to download the result data of Mapper from multiple nodes and process the data before it can be processed by Reducer.

4) Reduce reduction:

Reducer receives the data stream in the form and forms the output in the form. The specific process can be customized by the user, and the final result is directly written into hdfs. Each reduce process corresponds to an output file with the name beginning with part-.

Welcome to add.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.