In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)05/31 Report--
This article is about the Spark performance optimization of the top 10 problems and their solutions, the editor feels very practical, so share with you to learn, I hope you can get something after reading this article, say no more, follow the editor to have a look.
The number of problem 1:reduce task is not appropriate
Solution:
You need to adjust the default configuration according to the actual situation, by modifying the parameter spark.default.parallelism. Typically, the number of reduce is set to 2-3 times the number of core. The number is too large, resulting in a lot of small tasks, increasing the cost of starting tasks; the number is too small, the task runs slowly. Therefore, it is necessary to reasonably modify the number of task in reduce, that is, spark.default.parallelism.
The problem 2:shuffle disk IO time is long
Solution:
Set spark.local.dir to multiple disks, and set disks with high IO speed, and optimize shuffle performance by adding IO.
Problem 3:map | large number of reduce, resulting in a large number of small shuffle files
Solution:
Merge shuffle intermediate files by setting spark.shuffle.consolidateFiles to true, and the number of files is the number of reduce tasks
Problem 4: long serialization time and large result
Solution:
Spark uses the ObjectOutputStream that comes with JDK by default, which results in a large result and a long CPU processing time. You can set spark.serializer to org.apache.spark.serializer.KeyoSerializer.
In addition, if the result is already very large, it is best to use broadcast variables, the results you know.
Question 5: single record consumes a lot
Solution:
Replacing map,mapPartition with mapPartition is calculated for each Partition, while map is calculated for each record in partition
Problem 6: collect is slow to output a large number of results
Solution:
The collect source code puts all the results in memory in the form of an Array, which can be output directly to the distributed file system, and then view the contents of the file system.
Question 7: task execution speed tilt
Solution:
If the data is skewed, generally because partition key does not get well, you can consider other parallel processing methods and add aggregation operation in the middle; if it is Worker skew, for example, executor execution on some Worker is slow, you can remove those nodes that are persistently slow by setting spark.speculation=true
Question 8: many empty tasks or small tasks are generated after multi-step RDD operations
Solution:
Use coalesce or repartition to reduce the number of partition in RDD
Problem 9:Spark Streaming throughput is not high
You can set spark.streaming.concurrentJobs
Problem the speed of 10:Spark Streaming suddenly slows down, and there are often task delays and blocking.
Solution:
This is because the interval between setting job to start interval is too short, so that each job cannot be executed normally at a specified time. In other words, the time interval of the created windows window is too dense.
These are the top 10 problems of Spark performance optimization and their solutions. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.