What is the difference between the Shuffle process of Hadoop and Spark 07/02 Update SLTechnology News&Howtos

What is the difference between the Shuffle process of Hadoop and Spark

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "what is the difference between the Shuffle process of Hadoop and Spark". The content of the explanation is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "what is the difference between the Shuffle process of Hadoop and Spark"?

I. Preface

For distributed computing based on MapReduce programming paradigm, in essence, it is the process of calculating data intersection, union, difference, aggregation, sorting and so on. The idea of divide-and-conquer in distributed computing allows each node to calculate only part of the data, that is, to process only one fragment, so if you want to get the full data corresponding to a certain key, you must collect the data of the same key to the same Reduce task node to process, then the Mapreduce paradigm defines a process called Shuffle to achieve this effect.

Second, the purpose of writing this article

The purpose of this paper is to analyze the Shuffle process of Hadoop and Spark and compare the difference of Shuffle between them.

Third, the Shuffle process of Hadoop

Shuffle describes the process of data from Map to reduce, which is roughly divided into sort, spill, merge, Copy and merge sort. The general process is as follows:

! [image] (https://yqfile.alicdn.com/e4ccedfb6ccaaa0d3c0ad5b3b7ab83d96dd9fed2.png)

The output file of the Map in the image above is divided into red, green and blue, which is sliced according to the condition of Key, and the slicing algorithm can be implemented on its own, such as Hash, Range, etc., and the final Reduce task only pulls the corresponding color data for processing, to achieve the function of pulling the same Key to the same Reduce node. Let's talk about the various processes of Shuffle separately.

The Map side does the operation shown in the following figure:

1. Sort on Map

For the output data on the Map side, first write the ring buffer kvbuffer. When the ring buffer reaches a threshold (which can be set through the configuration file, the default is 80), the overflow will begin, but there will be a sort operation before the overflow. This sort operation first sorts the data in the Kvbuffer according to the partitioning value and the key keyword, and only moves the index data. The sorting result is that the data in the Kvmeta are clustered together according to the units of partition. Order in accordance with key within the same partition.

2. Spill (overflow) when sorting is complete, it begins to brush the data to the disk, and the process of brushing the disk is in partition units, one partition is written, and one partition is written down, the data in the partition is orderly, and in the end, it will actually be overwritten many times, and then generate multiple files. 3. Merge (merge) spill will generate multiple small files, which is quite inefficient for the reduce side to pull data, so there is the process of merge. The merging process is also the merging of the same fragments into a single segment, and finally all the segment are assembled into a final file, then the merging process is completed, as shown in the following figure.

At this point, the operation of Map has been completed, and the operation of the Reduce side will be launched soon.

Reduce operation

The overall process is in the red box of the figure below:

! [image] (https://yqfile.alicdn.com/71a52ed4799d3dbbde4552028f3aea05bc1c98c0.png) 1, pull copy (fetch copy)

The Reduce task pulls the corresponding shards from each Map task. This process is completed by the Http protocol, each Map node will start a resident HTTP server service, the Reduce node will request the Http Server to pull data, this process is completely transmitted through the network, so it is a very heavyweight operation.

2. Merge sort

On the Reduce side, after pulling the data corresponding to the shards in each Map node, the data will be sorted again, and the sorting is completed, and the result will be thrown to the Reduce function for calculation.

IV. Summary

Now that the whole shuffle process is completed, * summarize the following points:

The shuffle process is for global aggregation of key.

The sorting operation is accompanied by the whole shuffle process, so the shuffle of Hadoop is sort-based

Spark shuffle is relatively simple, because global ordering is not required, so there are not as many sort and merge operations. Spark shuffle is divided into two processes: write and read. Let's take a look at shuffle write first.

1. Shuffle write

The processing logic of shuffle write will be put into the * of the ShuffleMapStage (because spark divides the stage according to whether shuffle occurs or not, that is, wide dependency), and each record of final RDD will be written to the corresponding partition cache bucket, as shown in the following figure:

Description:

There are two CPU in the image above, and you can run two ShuffleMapTask at the same time.

Each task will write a buket buffer, the number of buffers equal to the number of reduce tasks

Each buket buffer generates a corresponding ShuffleBlockFile

How does ShuffleMapTask decide which buffer the data is written to? This has something to do with the partition algorithm, which can be hash's or range's.

How much ShuffleBlockFile will be generated eventually? That's the number of ShuffleMapTask multiplied by the number of reduce, which is huge.

Is there any way to solve the problem of too many generated files? Yes, you can enable FileConsolidation. The shuffle process after enabling FileConsolidation is as follows:

ShuffleMapTask executed successively in the same core CPU can share a bucket buffer, and then write to the same ShuffleFile. The ShuffleFile shown above is actually composed of multiple ShuffleBlock, so the final number of files generated by each worker becomes the number of cpu cores multiplied by the number of reduce tasks, greatly reducing the number of files.

II. Shuffle read

The Shuffle write process writes the data sharding to the corresponding sharding file. At this time, everything is ready, only need to pull the corresponding data to calculate.

So when is the timing of Shuffle Read sending? Do you want to wait for all the ShuffleMapTask execution before going to the fetch data? In theory, as long as there is a ShuffleMapTask execution, you can start the fetch data. In fact, the spark must wait until the parent stage execution is finished before executing the child stage, so you must wait until all ShuffleMapTask execution is complete before you fetch the data. The data from fetch is first stored in a Buffer buffer, so the FileSegment of one-time fetch cannot be too large. Of course, if the data from fetch is greater than each threshold, it will also be spill to disk.

After the process of fetch, a buffer of data can be aggregated. Here, we encounter a problem: how can global aggregation be achieved every time fetch part of the data is aggregated? Take the reduceByKey of word count ("ReduceByKey of Spark RDD operation") as an example. Suppose there are ten words hello, but only two words are pulled by fetch at a time, so how to aggregate globally? the practice of Spark is to use HashMap, and the aggregation operation is actually map.put (key,map.get (key) + 1) to add the aggregated data in map, and then put goes back, and when all the data fetch is finished, the global aggregation is completed.

Thank you for your reading, the above is the content of "what is the difference between the Shuffle process of Hadoop and Spark". After the study of this article, I believe you have a deeper understanding of what is the difference between the Shuffle process of Hadoop and Spark. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.