How to understand shuffle in MapReduce 04/26 Update SLTechnology News&Howtos

How to understand shuffle in MapReduce

2025-04-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

How to understand the shuffle in MapReduce, I believe that many inexperienced people do not know what to do about it. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

Conceptual explanation:

Shuffle: the simple name is called mixed washing. Actually, shuffle is a very, very simple concept. To put it simply, shuffle.

Shuffle: according to the fixed rules, as far as [key,value] is concerned.

Premise

Since Hadoop1 has been used before and Hadoop Yarn is not used, please refer to the latest underlying API for the mechanism of shuffle.

1: whether it's on Map or Reduce, whether it's Hadoop MapReduce or Storm. For the internal processing of data, it is often needed.

Make a reasonable trade-off between the underlying memory and disk.

Map side:

1: the data is not simply written directly to disk, the process is more complicated, he wrote to memory by buffering, and pre-sorted for the sake of efficiency.

2: every default Map task has a circular cache. This cache is used to hold the output of Map, which I remember is about 100m. Once the fixed Bili is reached, the content will be written to disk, and during the process of writing to disk, the Map output will continue to be written to the buffer.

3: before writing to the cache, the data will be partitioned and sorted once (this partition and sorting will be generated according to the corresponding relationship of reduce)

4: the data about this partition will be passed to the Reduce side through Http's protocol.

Reduce side:

The first phase of the Reduce port is the copy phase, which means we need to copy the data from the Map port to the Reduce port. If the port output of the Map is quite small, it will be copied to the Reduce side.

Second: after pulling the data, we will enter the stage of sorting. The sort phase phase, more appropriately, is a merge phase, because the sorting phase has been completed on the Map port. You just need to merge on the reduce side.

For MapReduce, I was also able to implement qq-like circle algorithms before, and for algorithms, I am also the person who introduced the whole algorithm. But now I forgot all about it. Memories will disappear and things that are not needed will be forgotten.

After reading the above, have you mastered how to understand shuffle in MapReduce? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.