Spark Shuffle inside decryption (24) 04/21 Update SLTechnology News&Howtos

Spark Shuffle inside decryption (24)

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

First, what exactly is Shuffle?

Shuffle is translated as "reshuffle" in Chinese. The key reason for the need for Shuffle is that some data with common characteristics need to be converged on a computing node for calculation.

Second, what are the problems that Shuffle may face?

1, the amount of data is very large

2, how to classify the data, that is, how to calculate Partition,Hash, Sort and tungsten wire

3. Load balancing (data skew)

4. Network transmission efficiency, we need to make a tradeoff between compression and decompression, serialization and anti-sequence are also issues to be considered.

Description: when calculating specific Task, do everything possible to make the data have the characteristics of Process Locality; the second step is to increase data fragmentation and reduce the amount of data processed by each Task.

III. Hash Shuffle

1, key cannot be Array

2. Hash Shuffle does not need sorting, which theoretically saves the waste of time when sorting Shuffle in Hadoop MapReduce, because there are a large number of Shuffle types that do not need sorting in the actual production environment.

Consider: is a Hash Shuffle that does not need to be sorted necessarily faster than a Sorted Shuffle that needs to be sorted? Not necessarily! In the case of data size ratio, Hash Shuffle will be faster than Sorted Shuffle (much)! But if the amount of data is large, Sorted Shuffle is generally faster than Hash Shuffle (much).

3. Each ShuffleMapTask will calculate the Partition that the current key needs to write according to the hash value of the key, and then write the result to a single file, which will result in R (the parallelism of the next Stage) files for each Task. If there are M ShuffleMapTask in the current Stage, there will be multiple files!

Note: Shuffle operations need to be done over the network in most cases. If Mapper and Reducer are on the same machine, you only need to read the local disk.

Two dead points of Hash Shuffle: first, a large number of small files will be generated on the disk before Shuffle, which will produce a large number of time-consuming and inefficient IO operations; second, memory is not shared! Due to the need to save a large number of file operation handles and temporary cache information in memory, if the scale of data processing is relatively large, the memory is unbearable, there are OOM and other problems!

4. Sorted Shuffle:

In order to improve the above problems (excessive use of Writer Handler memory caused by opening too many files at the same time and extremely inefficient disk IO operations caused by excessive files resulting in a large number of random reads and writes), Spark later introduced the Consalidate mechanism to merge small files, when the number of files generated in Shuffle is cores*R, in the case that the number of ShuffleMapTask is significantly more than the number of parallel Cores available at the same time. The number of files generated by Shuffle will be greatly reduced, which will greatly reduce the possibility of OOM

For this reason, Spark launched the Shuffle Pluggable open framework to facilitate the customization of Shuffle function modules when the system is upgraded, and it is also convenient for third-party system reformers to open specific and optimal Shuffle modules according to the actual business scenarios. The core interface ShuffleManager, such as HashShuffleManager and SortShuffleManager, is implemented by default. The specific configuration in Spark 1.6.1 is as follows:

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.