Spark tuning (2): adjusting parallelism 05/02 Update SLTechnology News&Howtos

Spark tuning (2): adjusting parallelism

2025-05-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Parallelism: in fact, it refers to the number of task of each stage in the Spark job, which represents the parallelism of the Spark job at each stage (stage).

What happens if parallelism is not adjusted and parallelism is too low?

Suppose, now that we have allocated enough resources to our spark job in the spark-submit script, such as 50 executor, each executor has 10 GB of memory, and each executor has 3 cpu core. It has basically reached the resource limit of the cluster or yarn queue.

Task is not set, or there are very few settings, for example, 100 task. There are 50 executor, each executor has 3 cpu core, that is, any stage of your Application has a total of 150 cpu core when running, which can be run in parallel. But now you have only 100 task, evenly distributed, each executor allocated to 2 task,ok, then there are only 100 task running at the same time, and each executor will only run 2 task in parallel. The remaining cpu core of each executor is wasted.

Although you have allocated enough resources, the problem is that the degree of parallelism does not match the resources, resulting in the waste of all the resources you have allocated.

A reasonable parallelism setting should be large enough to make full use of your cluster resources; for example, in the above example, the cluster has a total of 150 cpu core and can run 150 task in parallel. Then you should set the parallelism of your Application to at least 150g in order to make full use of your cluster resources and let 150task execute in parallel; and when the number of task is increased to 150G, it can be run in parallel at the same time, and the amount of data to be processed by each task can be reduced; for example, a total of 150G of data needs to be processed, if it is 100G task, each task calculates 1.5G of data. Now it has been increased to 150 task, which can be run in parallel, and each task can mainly handle 1G of data.

In a very simple way, as long as you set the parallelism properly, you can make full use of your cluster computing resources, reduce the amount of data to be processed by each task, and ultimately improve the performance and running speed of your entire Spark job.

1. The number of task is at least the same as the total number of cpu core of Spark application (ideally, for example, a total of 150 cpu core, with 150 task assigned to run together, running at about the same time)

2. It is officially recommended that the number of task should be set to 2 to 3 times the total number of cpu core in spark application, such as 150 cpu core. Basically, the number of task should be set to 300 to 500.

The actual situation is different from the ideal situation, some task will run faster, such as 50s, some task may be a little slower, it will take a minute and a half to finish, so if your number of task is exactly the same as the number of cpu core, it may still lead to a waste of resources, because, for example, 150task,10 is finished first, and the remaining 140s are still running, but at this time There are 10 cpu core available, which leads to waste. Well, if the number of task is set to 2 to 3 times the total number of cpu core, then after one task is run, another task can be made up immediately, so that the cpu core is not idle as far as possible. At the same time, it is also trying to improve the efficiency and speed of spark jobs, and improve performance.

3. How to set the parallelism of a Spark Application?

Spark.default.parallelism

SparkConf conf = new SparkConf ()

.set ("spark.default.parallelism", "500")

"heavy Sword without Front": some technologies and points that really carry weight actually look ordinary and not so "cool", but in fact, these are the first things you should adjust every time you finish a spark assignment and enter the stage of performance tuning (most of the time, the resources and parallelism may be in place, the spark homework will be very fast, and you will finish running in a few minutes).

"cool": data tilt (100 spark jobs, up to 10 will have really serious data tilting problems), colds and fever, you can't just use some folk prescriptions (* *, boil soup with toads); JVM tuning

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.