How to implement Spark Shuffle 07/06 Update SLTechnology News&Howtos

How to implement Spark Shuffle

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article shows you how to implement Spark Shuffle, the content is concise and easy to understand, it will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

For big data's computing framework, the design of the Shuffle phase is one of the key factors that determine the performance. The editor will introduce the current shuffle implementation of Spark and briefly compare it with MapReduce.

(1) basic concepts and common implementation methods of shuffle

Shuffle, an operator, expresses many-to-many dependencies. In the framework of MapReduce-like computing, it is the link between the Map phase and the Reduce phase, that is, each Reduce Task reads a piece of data from the data generated by each Map Task. In the limit case, it may trigger multiple data copy channels (M is the number of Map Task, R is the number of Reduce Task). Usually shuffle is divided into two parts: data preparation in Map phase and data copy in Reduce phase. First of all, the Map phase needs to determine the number of data shards output by each Map Task according to the number of Task in the Reduce phase. There are several ways to store these data shards:

1) keep it in memory or on disk (both Spark and MapReduce are stored on disk)

2) one file for each shard (the way Spark uses now, the way MapReduce used a few years ago), or all shards are put into one data file, plus an index file to record the offset of each shard in the data file (the way MapReduce uses now).

On the map side, different data storage methods have their own advantages and disadvantages and applicable scenarios. Generally speaking, the data of shuffle on the map side is stored on disk to prevent fault tolerance from triggering the huge overhead of recalculation (if saved to the memory of the reduce side, once the Reduce Task is dead, all Map Task need to be recalculated). However, there are many options for storing data on disk. In the early design of MapReduce, the current Spark scheme (which has been improving all the time) is adopted. Each Map Task generates a file for each Reduce Task, which only stores the data to be processed by a specific Reduce Task. This will result in multiple files. If M and R are very large, such as 1000, 100w files will be generated. Generating and reading these files will produce a large number of random IO, which is very inefficient. An intuitive way to solve this problem is to reduce the number of files. The common methods are:

1) merge all the files generated by Map on a node into one large file (the scheme currently adopted by MapReduce)

2) each node generates {(number of slot) * R} files (Spark optimized scheme). A simple explanation for the latter scheme: whether it is MapReduce 1.0 or Spark, the resources of each node are abstracted into several slot, and because one Task occupies one slot, the number of slot can be regarded as the maximum number of Task running simultaneously. If a Job has a very large number of Task and is limited to a limited number of slot, you may need to run several rounds. In this way, only {(number of slot) * R} files need to be generated in the first round, and the data generated in subsequent rounds can be appended to the end of these files.

Therefore, the latter scheme can reduce the number of files generated by large operations.

On the reduce side, each Task starts multiple threads concurrently to pull data from multiple Map Task sides at the same time. Because the main task of the Reduce phase is to regulate the data by group.

That is, data needs to be divided into groups so that it can be processed on a group-by-group basis. As we all know, there are many ways to group groups, the common ones are: Map/HashTable (same key, put in the same value list) and Sort (sorted by key, a group with the same key, which will be next to each other after sorting). These two methods have their own advantages and disadvantages. The first method is low complexity and high efficiency, but it needs to put all the data in memory, and the second scheme has high complexity. But you can handle large data sets with the help of disk (external sorting). Spark adopted the first scheme in the early stage, while the second scheme was added in the latest version, and MapReduce chose the scheme based on sort from the very beginning.

(2) the history of MapReduce Shuffle

[stage 1]: the development of MapReduce Shuffle is not uniform. At the beginning (before version 0.10.0), we adopted the scheme of "R files per Map Task". As mentioned earlier, this scheme will generate a large number of random read and write IO, which is very disadvantageous to big data's processing.

[phase 2]: in order to avoid a large number of files generated by Map Task, HADOOP-331 tries to optimize the scheme by providing a circular buffer for each Map Task. Once the buffer is full, spill the memory data to disk (plus an index file to save the offset of each partition), and finally merge the resulting spill files, while creating an index file to save the offset of each partition.

(phase 2): this phase does not tune the shuffle architecture, but optimizes the circular buffer of shuffle. Before the Hadoop version 2.0, when tuning the parameters of the MapReduce job, the buffer tuning of the Map phase is very complex, involving multiple parameters, because buffer is divided into two parts: one saves the index (such as parition, key and value offset and length), and the other saves the actual data, both of which will affect the number of spill files. Therefore, it is very cumbersome to tune multiple parameters according to the characteristics of the data. MAPREDUCE-64 solves this problem, which allows the index and data to share a circular buffer and no longer divides it into two parts to use independently, so only one parameter needs to be set to control the spill frequency.

[phase 3 (in progress)]: shuffle is currently embedded in the Reduce phase as a sub-phase. Because in the MapReduce model, Map Task and Reduce Task can run at the same time, the Reduce Task started in the early stage of a job will be in the shuffle phase until all Map Task runs, and in this process, Reduce Task takes up resources, but the utilization of these resources is very low, basically using only IO resources. In order to improve resource utilization, a very good way is to process shuffle independently from the Reduce phase to an independent phase / service, with a special shuffler service responsible for data copying. Baidu has already implemented this function (ready for open source? ), and the benefits are obvious. For more information: MAPREDUCE-2354.

(3) the history of Spark Shuffle.

At present, the development history of Spark Shuffle is very similar to that of MapReduce. In the early stage, Spark adopted the method of "R files per Map Task" in the Map stage and the map grouping method in the Reduce phase, but with the popularity of Spark, users gradually found that there was a serious bottleneck in dealing with big data, so they tried to optimize and improve Spark. The related links are: External Sorting for Aggregator and CoGroupedRDDs, "Optimizing Shuffle Performance in Spark", "Consolidating Shuffle Files in Spark", optimization motivation and ideas are very similar to MapReduce.

Spark relied too much on memory in the previous design, making it difficult for some large jobs running on MapReduce to run directly on Spark (may encounter OOM problems). At present, Spark is not perfect in dealing with large data sets, users need to selectively migrate some jobs to Spark according to the characteristics of jobs, rather than the whole migration. With the improvement of Spark, the design ideas of many internal key modules will become very similar to the upgraded Tez of MapReduce.

The above is how to implement Spark Shuffle. Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.