How to tune the performance of Spark 07/11 Update SLTechnology News&Howtos

How to tune the performance of Spark

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to tune the performance of Spark, I believe that many inexperienced people do not know what to do, so this paper summarizes the causes of the problem and solutions, through this article I hope you can solve this problem.

0. Background

The execution of some spark tasks in the cluster is very slow and often goes wrong. Changing the parameters can not optimize its performance and solve the problem of frequent random errors.

Looking at the historical operation of the task, the average time is about 3 hours, and it is extremely unstable, and occasionally reports errors:

1. Optimization ideas

What is the running time of the task related to?

(1) difference in data source size

Under the limited calculation, the running time of job is positively related to the amount of data. In this case, the amount of data is basically stable, and the problem caused by log level fluctuations can be eliminated:

(2) logical defects of the code itself.

For example, repeated creation in the code, initialization of variables, environment, RDD resources, etc., arbitrary persistence of data, extensive use of shuffle operators, such as reduceByKey, join and other operators.

In this 100 lines of code, there are three shuffle operations, and the task is divided into four stage serial execution by spark driver. The code location is as follows:

What we need to do is to reduce shuffle and stage as much as possible from an algorithm and business point of view, and improve parallel computing performance, which is a big topic and will not be discussed in detail this time.

(3) unreasonable parameter setting.

This technique is relatively general, let's take a look at the previous core parameter settings:

Suppose our spark queue resources are as follows:

Memory=1T,cores=400

There are some tricks on how to set parameters here. First of all, you need to understand how to allocate and use spark resources:

In the default non-dynamic resource allocation scenario, spark applies for resources in advance, and monopolizes resources before the task starts, until all the task of the entire job ends. For example, if you start a spark-shell on the springboard machine and do not execute the task, you will occupy all the applied resources all the time. (if num-executors is set, dynamic resource allocation will fail)

Note that the way spark uses and allocates resources is very different from that of mapreduce/hive, and if you don't understand this problem, it will cause other problems in parameter setting.

For example, how many executor-cores settings are appropriate? Without task parallelism, it is impossible to consume the entire queue resources, and other students' tasks cannot be performed. For example, the above task, in the case of num-executors=20 executor-cores=1 executor-memory= 10, will occupy 20 cores,200G memory for 3 hours.

So for the tasks in this case, combined with our existing resources, how to set these five core parameters?

1) executor_cores*num_executors should not be too small or too big! Generally, it does not exceed 25% of the total queue cores, such as the total queue cores 400. the maximum is not more than 100, and the minimum is not recommended to be less than 40, unless the log volume is very small.

2) executor_cores should not be 1! Otherwise, the number of threads in the work process is too small, generally 2-4 is appropriate.

3) executor_memory is generally 6 to 10g, with a maximum of no more than 20g, otherwise the cost of GC will be too high or the waste of resources will be serious.

4) spark_parallelism is generally 1-4 times that of executor_cores*num_executors, and the system default is 64. If it is not set, task will be executed in batches and serially, or a large number of cores will be idle, resulting in serious waste of resources.

5) some students in driver-memory set 20g earlier. In fact, driver does not do any calculation or storage, but only sends tasks to interact with yarn Explorer and task. Unless you are spark-shell, 1-2g is generally enough.

Spark Memory Manager:

6) spark.shuffle.memoryFraction (default 0.2), also known as ExecutionMemory. This memory area is used to solve the problem of shuffles,joins, the buffer needed in the sorts and aggregations process to avoid frequent IO. If your program has a large number of such operations, you can adjust it up appropriately.

7) spark.storage.memoryFraction (default 0.6), also known as StorageMemory. This memory area is used to solve block cache (that is, you call dd.cache, rdd.persist, etc.), as well as broadcasts, and task results storage. You can use parameters, and if you call a large number of persistence operations or broadcast variables, you can increase it appropriately.

8) OtherMemory, reserved for the system, because the program itself also needs memory to run (the default is 0.2). Other memory is also adjusted at 1.6m to ensure that at least 300m is available. You can also set spark.testing.reservedMemory manually. Then subtract the actual available memory from the reservedMemory to get the usableMemory. ExecutionMemory and StorageMemory will share memory of usableMemory * 0.75. 0.75 can be set by the new parameter spark.memory.fraction. Currently, the default value of spark.memory.storageFraction is 0. 5, so the default of ExecutionMemory,StorageMemory is to divide the available memory mentioned above equally.

For example, if you need to load a large dictionary file, you can increase the size of the StorageMemory in executor, so that you can avoid global dictionary swapping and reduce GC, in which case, we are equivalent to using memory resources in exchange for execution efficiency.

The optimized parameters are as follows:

The effect is as follows:

(4) analyze performance bottlenecks by performing logs

The final task will take another hour, so where on earth did you spend that hour? According to my experience and understanding, if the data of a single day is not too large and does not involve complex iterative calculation, it should not be more than half an hour.

Since the Spark History Server of the cluster has not been installed and debugged, it is impossible to view the visual execution details of historical tasks through spark web UI, so I wrote a small script to analyze the specific calculation time information before and after, so that you can clearly see which stage is the problem and targeted optimization.

You can see that the bottleneck after optimization is mainly in the final stage of writing redis. It is a challenge for redis to write 60 gigabytes of data and 2.5 billion results into redis, which can only be optimized from the point of view of the amount of data written and the selection of kv database.

(5) other optimization angles

Of course, optimization and high performance is a very general and challenging topic, in addition to the previously mentioned code, parameter level, and how to prevent or reduce data skew, which need to be analyzed for specific scenarios and logs, and will not be expanded here.

2. Some misunderstandings of spark beginners.

For beginners, spark seems to be omnipotent and high-performance, and even in the eyes of some bloggers and technicians, the minute-by-minute replacement of mapreduce, hive and storm by spark is a silver bullet in big data's batch processing, machine learning, real-time processing and other fields. But is that really the case?

From the above case, we can see that being able to use spark, tune API and make good use of spark are two different things, which requires us to understand not only the principle, but also the business scenario. Combine the right technical solutions, tools and the right business scenarios-there is no silver bullet in the world.

When it comes to the performance of spark, if you want it to be fast, you must make full use of system resources, especially memory and CPU: the core idea is that if you can use memory cache, don't spill disk, CPU can parallel, don't serial, data can local, don't shuffle.

After reading the above, have you mastered how to tune the performance of Spark? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.