How to optimize the performance of Spark 07/01 Update SLTechnology News&Howtos

How to optimize the performance of Spark

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

How to optimize the performance of Spark? in view of this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.

Spark is especially suitable for manipulating specific data multiple times, such as mem-only and mem & disk. Among them, mem-only: high efficiency, but occupies a lot of memory, the cost is very high; mem & disk: after the memory is used up, it will automatically migrate to the disk, which solves the problem of insufficient memory, but brings the consumption of data replacement. Common tuning tools for Spark include nman, Jmeter, and Jprofile. The following is an example of Spark tuning:

1. Scenario: accurate customer base

The query optimization of a customer information table with a capacity of 300g is carried out on spark. The wide table has more than 1800 columns and 20 columns are used effectively.

2. The effect of optimization: the query is reduced from 40.232s to 2.7s.

3. Analysis of optimization process.

The first step: first, we found that there are a large number of iowait on the disk. By looking at the relevant log files, we found the size of a block and then calculated that the size of the entire data file was 300g. The entire memory could not accommodate it. We used the compression method to achieve optimization. Combined with the characteristics of this data file, there were a large number of 0 and 1. We chose Gzip algorithm to compress, and the compressed size was 1.9g. This step reduced the query from 40.232 to 20.12s.

Step 2: there are more than 1800 columns in the wide table, but only more than 20 columns are effectively used, so only valid columns are loaded through RCFILE, which reduces the query from 20s to 12s.

Step 3: through Jprofile analysis of the CPU load is too high, what is the cause, carefully found that there is something wrong with the serialization mechanism. There are two serialization frameworks for Spark: java's own and kryo's. Kryo is a fast and efficient Java object graphic serialization framework, which is mainly characterized by performance, efficiency and ease of use. After being replaced by kryo, the query is reduced from 12s to 7s.

Step 4: further analyze that the load of each core of CPU is very uneven, the memory is not full, and the resources of the system are not fully utilized, how to use it? (1) the number of partition in the RDD of Spark corresponds to the number of task created; (2) the number of Partition in the RDD of hadoop is determined by the number of block, memory: total memory of the system = work memory size * work = SPARK_WORKER_MEMORY*SPARK_WORKER_INSTANCES

CPU: total number of task of the system = number of work × number of cores occupied by work = SPARK_WORKER_INSTANCES*SPARK_WORKER_CORES, calculate task parallelism, memory allocation, and tuning parameters:

SPARK_WORKER_INSTANCES=4

SPARK_WORKER_CORES = 3

SPARK_WORKER_MEMORY = 6G

Cpu (12core) mem (24G), through the optimization of these parameters, the query is reduced from 7s to 5s.

Step 5: it is further found that there is an obvious fullGC on the Sharkserver side, and by tuning the parameters

Export SHARK_MASTER_MEM=2g, this step is reduced from 6s to 3sl

Step 6: it is also found that when the two tables are associated, there is a bottleneck in cpu. The reason is that the daily table is compressed by gzip. The optimization method: the daily table does not use gzip compression, and the daily table is made into a memory table. The query is reduced from 3s to 2s.

This is the answer to the practical question on how to tune the performance of Spark. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel to learn more about it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.