What are Maps and Reduces? 07/06 Update SLTechnology News&Howtos

What are Maps and Reduces?

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article will explain in detail what Maps and Reduces are. The editor thinks it is very practical, so I share it with you as a reference. I hope you can get something after reading this article.

Map and reduce are the core functions of hadoop. Hadoop implements distributed parallel computing of tasks through the parallel operation of multiple map and reduce.

From this point of view, if the number of map and reduce is set to 1, then the user's tasks will not be executed in parallel, but the number of map and reduce should not be too large. Too much map and reduce can improve the parallelism of tasks, but too much map and reduce will also cause the entire hadoop framework to fail because of excessive system resource overhead. Therefore, users should submit map/reduce jobs within a reasonable range, which can not only enhance the load balance of the system, but also reduce the overhead of task failure.

Extreme case: one extreme is the case of 1 map and 1 reduce, so there are no tasks in parallel. At the other extreme, 1000000 map and 1000000 reduce will run out of system resources due to the excessive overhead of the framework.

So, reasonable selection of the number of Tasks in Job can significantly improve the performance of Hadoop execution. Increasing the number of task will increase the overhead of the system framework, but it will also enhance load balancing and reduce the overhead of task failures.

Number of map

The number of map is usually determined by the DFS block size of the hadoop cluster, that is, the total number of blocks of the input file. The parallel scale of a normal number of map is approximately 100.100 per Node. For jobs with low CPU consumption, you can set the number of Map to about 300. however, since none of the tasks in hadoop takes a certain amount of time to initialize, it is reasonable that each map takes at least 1 minute to execute.

The specific data sharding is like this. By default, InputFormat is sharded according to the DFS block size of the hadoop cluster, and each shard is processed by a map task. Of course, users can customize the settings in the job submission client by using the parameter mapred.min.split.size parameter. Another important parameter is mapred.map.tasks, which sets the number of map only as a hint. It only works when InputFormat determines the number of map tasks that are smaller than the mapred.map.tasks value. Similarly, the number of Map tasks can be set manually by using JobConf's conf.setNumMapTasks (int num) method. This method can be used to increase the number of map tasks, but cannot set the number of tasks to be less than the value obtained by dividing the input data of the Hadoop system. Of course, in order to improve the concurrency efficiency of the cluster, you can set a default number of map. When the number of users' map is small or smaller than the value of automatic segmentation, you can use a relatively large default value, so as to improve the efficiency of the overall hadoop cluster.

MapReduce reads the split file from HDFS and gives it to Mapper through InputFormat. Split is the smallest computing unit in MapReduce, and one split file corresponds to one Map Task.

A block in a default,HDFS that corresponds to a split.

When performing WordCount:

If an input file is less than 64m block, it is saved in a block in hdfs, corresponding to a split file, so, which produces a Map Task.

If an input file is 150m block default, it is saved in three block in hdfs, corresponding to three split files, so, resulting in three Map Task.

If there are three input files less than 64m block default will be saved in three block in hdfs, corresponding to three split files, so, resulting in three Map Task.

Users can specify the relationship between block and split. One block and one split in HDFS can correspond to multiple block,split and block is an one-to-many relationship.

The summary of the number of Map Task in MapReduce jobs is made up of:

Enter the number and size of files

‍ hadoop sets the relationship between split and block. ‍

Number of reduece

At run time, reduce often needs to copy data from the relevant map side to the reduce node to process, so compared to the map task.

Reduce node resources are relatively scarce and run relatively slowly, so the number of correct reduce tasks should be 0.95or 1.75* (number of nodes × mapred.tasktracker.tasks.maximum parameter value). If the number of tasks is 0.95 times the number of nodes, then all reduce tasks can start running at the same time after the output transfer of the map task ends. If the number of tasks is 1.75 times the number of nodes, then the high-speed nodes will start to calculate the second batch of reduce tasks after completing their first batch of reduce tasks, which is more conducive to load balancing. At the same time, it should be noted that increasing the number of reduce will increase the resource overhead of the system, but it can improve the load balance and reduce the negative impact of task failure. Similarly, like map tasks, Reduce tasks can increase the number of tasks by setting the conf.setNumReduceTasks (int num) method of JobConf.

Number of reduce is 0

Some jobs do not need to be reduced for processing, so you can set the number of reduce to 0 for processing. In this case, the user's job runs at a relatively high speed, and the output of map is written directly to the output directory set by SetOutputPath (path) instead of being written locally as an intermediate result. At the same time, the Hadoop framework does not sort the file system before it is written.

Map red.tasktracker.map.tasks.maximum this is the maximum number of map that can be executed concurrently in a tasktracker. The default value is 2.

Look at "pro hadoop": it is common to set this value to the effective number of CPUs on the node divides job into map and reduce. Reasonable selection of the number of Tasks in Job can significantly improve the performance of Hadoop execution. Increasing the number of task will increase the overhead of the system framework, but it will also enhance load balancing and reduce the overhead of task failures. At one extreme is the case of 1 map and 1 reduce, so there are no tasks in parallel. At the other extreme, 1000000 map and 1000000 reduce will run out of system resources due to the excessive overhead of the framework.

Number of Map tasks

The number of Map is often determined by the number of DFS blocks in the input data. This often causes users to adjust the number of DFS by resizing map blocks. The correct degree of parallelism for map tasks seems to be 10-100 maps/ nodes, although we have adjusted this number to each 300maps node for tasks with a small amount of cpu computation. Initialization of Task takes some time, so it is best to control the execution of each map task for more than a minute.

It is actually very subtle to control the number of map tasks. The mapred.map.tasks parameter is just a hint for InputFormat to set the number of map executes. The behavior of the InputFormat should divide the total byte value of the input data into the appropriate number of fragments. By default, however, the block size of the DFS becomes the upper bound of the fragment size for the input data. The lower bound of a partition size can be set by a mapred.min.split.size parameter. Therefore, if you have an input data whose size is 10TB and set the DFS block size to 128m, you must set at least 82k map tasks unless you set the mapred.map.tasks parameter greater than this number. In the end, InputFormat determines the number of map tasks.

The number of Map tasks can also be set manually by using JobConf's conf.setNumMapTasks (int num) method. This method can be used to increase the number of map tasks, but cannot set the number of tasks to be less than the value obtained by dividing the input data of the Hadoop system.

Number of Reduce tasks

The number of correct reduce tasks should be 0.95 or 1.75 × (number of nodes × mapred.tasktracker.tasks.maximum parameter value). If the number of tasks is 0.95 times the number of nodes, then all reduce tasks can start running at the same time after the output transfer of the map task ends. If the number of tasks is 1.75 times the number of nodes, then the high-speed nodes will start to calculate the second batch of reduce tasks after completing their first batch of reduce tasks, which is more conducive to load balancing.

The current number of reduce tasks depends on the output file buffer size (io.buffer.size × 2 × reduce tasks)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.