How hadoop1.x MapReduce works 11/02 Update SLTechnology News&Howtos

How hadoop1.x MapReduce works

2025-11-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

How MapReduce solves Load Balancer and Data Skew:

The stage mainly lies in how to divide the map processed results into several parts and hand them over to the Reduce operation in the shuffer process after the Map operation ends, so that each part of the reduce operation processes the data calculation as evenly as possible.

By default, partitions are divided into Hash modulo operations (the hash value of the storage object is modulo the number of reduces), which is easy to cause data skew, resulting in a large number of data calculations for one reduce job and basically no data processing for the other reduce job. How to solve this kind of thing, you need to rely on the program to control the value of partitions

Sort: The default sort is dictionary sorted (ASCII)

Shuffer phase comparison operation to be performed twice, one is the map task after the sort and the other is before the partition data copied from the disk to the specified reduce merge, will meet the unified range of key data merge

Interview:

1. partition: divide the data output by map according to a certain rule, and assign it to which reduce. By default, hash module operation is used for execution.

2.spill：

Procedure: When the memory buffer of map is full, a separate thread is started to write the data proportionally to the local disk.

Sort: Sort data by size (customizable) Default dictionary sort

Combiner: (optional) adds up the values in the same K_V to reduce data overwrites on disk

Shuffer's second half:

Copy the map processed data into the local disk of the map node to the memory of the rudite node. If the data is small, it will be directly processed by reduce. When the amount of data is large, it also needs to be overwritten to disk, merge according to the same method as K value, and then pass it to the specified reduce for execution.

Modify the block size of the default hdfs:

This requires modifying the hdfs-site.conf configuration file to add the global parameter dfs.block.size.

As follows:

Modified parameters,

Sync the configuration to other nodes and restart hdfs.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.