In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)05/31 Report--
This article introduces the relevant knowledge of "what is the use of ort shuffle". In the operation of actual cases, many people will encounter such a dilemma. Then let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!
Spark implements a variety of shuffle methods, which are determined by spark.shuffle.manager. For the time being, there are a total of three: hash shuffle, sort shuffle, and tungsten-sort shuffle, which defaults to sort shuffle since 1.2.0.
From 1.2.0 onwards, the default is sort shuffle (spark.shuffle.manager = sort), the implementation logic is similar to Hadoop MapReduce,Hash Shuffle, each reducers produces a file, but Sort Shuffle only produces a file that can be indexed by reducer id, so that you only need to get the location information about the relevant data blocks in the file, and fseek can read the data of the specified reducer. But for cases where the number of rueducer is relatively small, Hash Shuffle is obviously faster than SortShuffle, so SortShuffle has a "fallback" plan, for the number of reducers is less than "spark.shuffle.sort.bypassMergeThreshold" (200by default), we use the fallback plan, hashing related data to separate files, and then merge these files into one, specifically implemented as BypassMergeSortShuffleWriter.
Sort in map and merge with Timsort [1] on the reduce side. Whether spill is allowed on the map side is set through spark.shuffle.spill. The default is true. Set to false, if there is not enough memory to store the output of map, it will cause OOM errors, so use it with caution.
The memory used to store map output is "JVM Heap Size"\ * spark.shuffle.memoryFraction\ * spark.shuffle.safetyFraction, and the default is "JVM Heap Size"\ * 0.2\ * 0.8 = "JVM Heap Size"\ * 0.16. If you run multiple threads in the same executor (set spark.executor.cores/ spark.task.cpus more than 1), the storage space for each map task is "JVM Heap Size" * spark.shuffle.memoryFraction * spark.shuffle.safetyFraction / spark.executor.cores * spark.task.cpus, with a default of 2 cores, then 0.08 * "JVM Heap Size". Spark uses AppendOnlyMap to store map output data, and uses the open source hash function MurmurHash4 and square detection to store key and value in the same array. This method of saving can be spark for combine. If spill is true, it will sort before spill.
For more details on the source code level of Sort Shuffle memory, please refer to [4], and for the reading and writing process, please refer to [5].
# # advantages
Less files are created by map
A small number of IO random operations, mostly sequential read and write
# # disadvantages
To be slower than Hash Shuffle, you need to set the appropriate value yourself through spark.shuffle.sort.bypassMergeThreshold.
If you use an SSD disk to store shuffle data, then Hash Shuffle may be more appropriate.
This is the end of the content of "what's the use of ort shuffle"? thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.