What is the Shuffle principle and the corresponding Consolidation optimization mechanism? 07/09 Update SLTechnology News&Howtos

What is the Shuffle principle and the corresponding Consolidation optimization mechanism?

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces the principle of Shuffle and the corresponding Consolidation optimization mechanism, the content is very detailed, interested friends can refer to, hope to be helpful to you.

What is Shuffle?

Shuffle is the bridge between MapTask and ReduceTask processes, and the output of MapTask must go through the Shuffle process to become the input of ReduceTask. In distributed clusters, ReduceTask needs to pull the output results of MapTask across nodes, which involves data transmission over the network and disk IO, so the quality of Shuffle will directly affect the performance of the whole application. Usually we divide the Shuffle process into two parts: the output of the result on the MapTask side is ShuffleWrite, and the data pull on the ReduceTask side is called ShuffleRead.

The principle of Spark's Shuffle process is basically similar to that of MapReduce's Shuffle process, and some concepts can be applied directly. For example, in the Shuffle process, one end of providing data is called Map side, and each task of generating data on Map side is called Mapper, and the corresponding end of receiving data is called Reduce side. Every task of pulling data on Reduce side is called Reducer,Shuffle process, which essentially divides the data obtained by Map side using a divider. The process of sending data to the corresponding Reducer.

As we mentioned in the previous article, the division of Stage in Spark tasks is based on the wide and narrow dependencies of RDD; in narrow dependencies, the relationship between parent RDD and child RDD partition is one-to-one. Or the parent RDD and child RDD relationship is many-to-one in the case where the parent RDD corresponds to the partition of only one child RDD partition. There will be no shuffle. A partition of the parent RDD goes to a partition of the child RDD. In broad dependencies, the relationship between parent RDD and child RDD partition is one-to-many. There will be shuffle. The data from one partition of the parent RDD goes to different partitions of the child RDD.

In real-world scenarios, 90% of tuning occurs in the shuffle phase, so this type of tuning is very important.

Second, the general Spark HashShuffle principle

Let's first take a look at a picture of how the ordinary Shuffle principle works. I will explain the basic Shuffle principle to you with this diagram.

Normal Shuffle execution process:

1. With three ShuffleMapTask and two ResultTask,ShuffleMapTask on it, the corresponding number of bucket,bucket will be created according to the number of ResultTask is 3 × 3.

two。 Secondly, the results produced by ShuffleMapTask will be populated into each bucket according to the set partition algorithm. The partition algorithm here can be customized, of course, the default algorithm is to go to different bucket according to key hash, and the final landing is ShuffleBlockFIle.

The output ShuffleBlockFIle location information of the 3.ShuffleMapTask is sent as a MapStatus to the Master of the MapOutputTracker of the DAGScheduler.

3. When ShuffleMapTask starts, it will read the ShuffleBlockFIle location information from the MapOutputTracker according to the id of its task and the id of the ShuffleMapTask it depends on, and finally get the corresponding ShuffleBlockFIle from the remote or local blockmanager as the input of the ResultTask.

If there are too many ShuffleMapTask and ResultTask, small files will be generated, which will cause ShuffleWrite to spend a lot of performance on the creation of disk files and disk IO, which will put a lot of pressure on the system. The picture I drew above is not very good. Let me put it in words here:

One situation A: if there are 4 ShuffleMapTask and 4 ResultTask, my machine only has 2 cpu cores, and each task needs a cpu to run by default, so my 4 ShuffleMapTask will run in two batches, and only two Task will run at the same time. The first batch of Task will generate 2-4 ShuffleBlockFIle files, and the second batch of Task will still generate 2-4 ShuffleBlockFIle files, which will produce 16 small files.

Another scenario B: I still have 4 ShuffleMapTask and 4 ResultTask, my machine only has 4 cpu or more cpu cores, my 4 ShuffleMapTask will run in the same batch, or will produce 4 / 4 / 16 small files.

Existing problems:

A large number of small files will be generated on disk before 1.Shuffle. When ResultTask is used to pull data in distributed mode, there will be a large number of small file creation and disk IO operations.

two。 It may lead to OOM, a large number of time-consuming and inefficient IO operations, resulting in too many objects when writing to disk and too many objects when reading disk. These objects are stored in heap memory, which will lead to insufficient heap memory, which will lead to frequent GC,GC and OOM. Due to the need to save a large number of file operation handles and temporary information in memory, if the scale of data processing is relatively large, the memory is unbearable, there will be problems such as OOM.

Second, the Spark HashShuffle principle of turning on the Consolidation mechanism

In view of the shortcomings of the basic Shuffle above, the Consolidation mechanism is introduced at the beginning of later versions of Spark0.81, which is controlled by the parameter spark.shuffle.consolidateFiles. Set it to true to enable the optimization mechanism. Let's take a look at how the optimized Shuffle is handled:

Optimized Shuffle principle:

This is equivalent to optimizing the "case B" above, merging multiple ShuffleMapTask outputs running on the same core into the same file, so that the number of files becomes cores*ResultTask ShuffleBlockFile files. It must be noted that the ShuffleMapTask running in the same batch must be written in different files, and only different batches of ShuffleMapTask will write the same file. When the first batch of ShuffleMapTask is run, the same file will be written. Later, the TShuffleMapTask running on the same cpu core will write the ShuffleBlockFile file written by ShuffleMapTask on this cpu core.

At this point, the principle of Spark HashShuffle and its Consolidation mechanism are explained, but if there are too many parallel tasks or data fragments on the Reducer side, the Core * Reducer Task is still too large, and many small files will be generated.

The above principles are all based on HashShuffleManager. After Spark1.2.x, HashShuffleManager is no longer the default ShuffleManager for Spark. After Spark1.2.x, the default ShuffleManager for Spark is SortShuffleManager. HashShuffleManager has been deprecated since Spark2.0.

On the principle of Shuffle and the corresponding Consolidation optimization mechanism is shared here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.