In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article will give you a detailed explanation on how to tune sparksql. The editor thinks it is very practical, so I share it with you as a reference. I hope you can get something after reading this article.
1djvm tuning
This is endless, and the management is still messy. It is recommended to add memory as much as you can. No matter what JVM you can adjust, you don't know JVM and your task data.
Memory and GC tuning of spark tuning series
2. Memory tuning
Cache table
Spark2.+ adopts:
Spark.catalog.cacheTable ("tableName") caches the table and spark.catalog.uncacheTable ("tableName") uncaches the table.
Spark 1.+ uses:
Use sqlContext.cacheTable ("tableName") cache and sqlContext.uncacheTable ("tableName") to uncache
Sparksql caches only the necessary columns and automatically adjusts the compression algorithm to reduce memory and GC pressure.
Attribute
Default value
Introduction
Spark.sql.inMemoryColumnarStorage.compressed
True
If set to true,SparkSql, compression is automatically selected for each column based on statistics.
Spark.sql.inMemoryColumnarStorage.batchSize
10000
Controls the batch size of the column cache. Large batches help improve memory usage and compression, but caching data carries the risk of OOM
3, broadcast
When join the size table, it is a good choice to broadcast the small table to all Worker nodes to improve performance. Spark provides two parameters that can be adjusted, and different versions will be slightly different. This article takes Spark2.2.1 as an example.
Attribute
Default value
Description
Spark.sql.broadcastTimeout
three hundred
Broadcast wait timeout (in second)
Spark.sql.autoBroadcastJoinThreshold
10485760 (10 MB)
The size of the maximum broadcast table. Set to-1 to disable this feature. Current statistics only support Hive Metastore tables
In fact, the use of broadcast variables is sometimes useless. Only when there are too many tasks to boast about the use of data by stage can highlight its real role. The task is over, but it doesn't matter whether the broadcast is broadcast or not.
4. Regulation and control of partition data
The partition setting is spark.sql.shuffle.partitions, which is 200 by default.
For some companies, it is estimated that when using Spark sql, there will be less data to deal with, and then fewer resources. At this time, the number of shuffle partitions is too large and should be appropriately reduced to improve performance.
There are also some companies, it is estimated that in dealing with offline data, the amount of data is particularly large, and resources are sufficient, at this time the number of shuffle partitions is obviously not enough, should be appropriately adjusted.
If it is appropriate, it depends entirely on experience.
5, files and partitions
There are two parameters that can be adjusted:
One is how much data a partition accepts when reading a file.
The other is the cost of opening files, which is commonly understood as the threshold for merging small files.
The cost of opening a file is measured by the cost. Spark uses a better way to measure the cost of opening a file by the number of bytes of data that can be scanned at the same time.
The parameters are described as follows:
Attribute name
Default value
Introduction
Spark.sql.files.maxPartitionBytes
134217728 (128 MB)
Package the maximum bytes passed into a partition when reading the file.
Spark.sql.files.openCostInBytes
4194304 (4 MB)
The cost of opening a file is measured by the size of the data that can be scanned at the same time. This parameter is useful when multiple files are written to the same partition. It is advantageous to set a larger value, and partitions with small files will be processed faster than large file partitions (priority scheduling).
Spark.sql.files.maxPartitionBytes the adjustment of this value should be combined with the degree of concurrency you want and the size of memory.
Spark.sql.files.openCostInBytes said that to put it bluntly, this parameter is the threshold for merging small files, and files less than this threshold will be merged.
6, file format
Parquet or orc is recommended. Parquet can already achieve a lot of performance.
This is the end of the article on "how to tune sparksql". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.