How to tune sparksql 09/13 Update SLTechnology News&Howtos

How to tune sparksql

2025-09-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article will give you a detailed explanation on how to tune sparksql. The editor thinks it is very practical, so I share it with you as a reference. I hope you can get something after reading this article.

1djvm tuning

This is endless, and the management is still messy. It is recommended to add memory as much as you can. No matter what JVM you can adjust, you don't know JVM and your task data.

Memory and GC tuning of spark tuning series

2. Memory tuning

Cache table

Spark2.+ adopts:

Spark.catalog.cacheTable ("tableName") caches the table and spark.catalog.uncacheTable ("tableName") uncaches the table.

Spark 1.+ uses:

Use sqlContext.cacheTable ("tableName") cache and sqlContext.uncacheTable ("tableName") to uncache

Sparksql caches only the necessary columns and automatically adjusts the compression algorithm to reduce memory and GC pressure.

Attribute

Default value

Introduction

Spark.sql.inMemoryColumnarStorage.compressed

True

If set to true,SparkSql, compression is automatically selected for each column based on statistics.

Spark.sql.inMemoryColumnarStorage.batchSize

10000

Controls the batch size of the column cache. Large batches help improve memory usage and compression, but caching data carries the risk of OOM

3, broadcast

When join the size table, it is a good choice to broadcast the small table to all Worker nodes to improve performance. Spark provides two parameters that can be adjusted, and different versions will be slightly different. This article takes Spark2.2.1 as an example.

Attribute

Default value

Description

Spark.sql.broadcastTimeout

three hundred

Broadcast wait timeout (in second)

Spark.sql.autoBroadcastJoinThreshold

10485760 (10 MB)

The size of the maximum broadcast table. Set to-1 to disable this feature. Current statistics only support Hive Metastore tables

In fact, the use of broadcast variables is sometimes useless. Only when there are too many tasks to boast about the use of data by stage can highlight its real role. The task is over, but it doesn't matter whether the broadcast is broadcast or not.

4. Regulation and control of partition data

The partition setting is spark.sql.shuffle.partitions, which is 200 by default.

For some companies, it is estimated that when using Spark sql, there will be less data to deal with, and then fewer resources. At this time, the number of shuffle partitions is too large and should be appropriately reduced to improve performance.

There are also some companies, it is estimated that in dealing with offline data, the amount of data is particularly large, and resources are sufficient, at this time the number of shuffle partitions is obviously not enough, should be appropriately adjusted.

If it is appropriate, it depends entirely on experience.

5, files and partitions

There are two parameters that can be adjusted:

One is how much data a partition accepts when reading a file.

The other is the cost of opening files, which is commonly understood as the threshold for merging small files.

The cost of opening a file is measured by the cost. Spark uses a better way to measure the cost of opening a file by the number of bytes of data that can be scanned at the same time.

The parameters are described as follows:

Attribute name

Default value

Introduction

Spark.sql.files.maxPartitionBytes

134217728 (128 MB)

Package the maximum bytes passed into a partition when reading the file.

Spark.sql.files.openCostInBytes

4194304 (4 MB)

The cost of opening a file is measured by the size of the data that can be scanned at the same time. This parameter is useful when multiple files are written to the same partition. It is advantageous to set a larger value, and partitions with small files will be processed faster than large file partitions (priority scheduling).

Spark.sql.files.maxPartitionBytes the adjustment of this value should be combined with the degree of concurrency you want and the size of memory.

Spark.sql.files.openCostInBytes said that to put it bluntly, this parameter is the threshold for merging small files, and files less than this threshold will be merged.

6, file format

Parquet or orc is recommended. Parquet can already achieve a lot of performance.

This is the end of the article on "how to tune sparksql". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.