Spark IV: shuffle 07/06 Update SLTechnology News&Howtos

Spark IV: shuffle

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Shuflle write

The figure above has four ShuffleMapTask to run on the same worker node. The number of CPU core is 2, and you can run two task at the same time. ShuffleMapTasks executed successively on a core can share a single output file, ShuffleFile. The ShuffleMapTask that is executed first forms the ShuffleBlock I, and then the ShuffleMapTask that executes can append the output data directly to the ShuffleBlock I to form a ShuffleBlock I, and each ShuffleBlock is called FileSegment. When does shuflle read fetch data? Do not fetch until all ShuffleMapTasks of parent stage is finished. Fetch while processing or one-time fetch after processing? Fetch while processing. Using data structures that can be aggregate, such as HashMap, you get (deserialize out of the buffered FileSegment) a record per shuffle and put it directly into the HashMap. If the corresponding Key already exists in the HashMap, where is the data from the func (hashMap.get (Key), Value) fetch stored directly for aggregate? The FileSegment that just came from fetch is stored in the softBuffer buffer, and the processed data is placed on memory + disk. How to get the location of the data to be fetch? When reducer is in shuffle, he will go to the MapOutputTrackerMaster in driver to ask for the location of the data output by ShuffleMapTask. When each ShuffleMapTask is completed, the storage location information of the FileSegment is reported to the MapOutputTrackerMaster. HashMap in Shuffle read

AshMap is a frequently used data structure for aggregate in the Spark shuffle read process. Spark has two designs: one is full-memory AppendOnlyMap, the other is memory + disk ExternalAppendOnlyMap.

Similar to HashMap, but without a remove (key) method. Its implementation principle is very simple, open a large Object array, the blue part stores Key, the white part stores Value. If the utilization of Array reaches 70%, then double the size, and after rehash all key, rearrange the location of each key.

ExternalAppendOnlyMap holds one by one (K, V) record from AppendOnlyMap,shuffle to AppendOnlyMap first, and the insert process is exactly the same as the original AppendOnlyMap. If the AppendOnlyMap is almost full, check to see if the remaining memory space can be expanded enough, expand directly in memory, sort AppendOnlyMap if not enough, and spill all of its internal records to disk. Each time spill finishes generating a spilledMap file on disk, and then re-new out an AppendOnlyMap. After the last (K, V) record insert to AppendOnlyMap, it means that all the records from shuffle has been put into ExternalAppendOnlyMap, but it does not mean that records has been processed, because every time insert, the new record only aggregate with records in AppendOnlyMap, not aggregate with all records (some records has been spill to disk). So when you need the final result of aggregate, you need to do a global merge-aggregate of AppendOnlyMap and all spilledMaps. The process of global merge-aggregate: first sort the records in AppendOnlyMap to form sortedMap. Then part of the data (StreamBuffer) is read out from the sortedMap and each spilledMap and put into the mergeHeap. The records contained in StreamBuffer needs to have the same hash (key) mergeHeap, as the name implies, is to use heap sorting to continuously extract the same StreamBuffer of hash (firstRecord.Key), and put them into the mergeBuffers one by one, and merge-combine them with the StreamBuffer that already exists in the mergeBuffers.

During the Shuffle Write phase of Sort Based Shuffle, tasks on the map side sort records by Partition id and key. At the same time, all the results are written into a data file, and an index file is generated at the same time, through which the Task on the reduce side can obtain the relevant data.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.