How to tune spark on yarn Cluster 04/24 Update SLTechnology News&Howtos

How to tune spark on yarn Cluster

2025-04-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Today, I will talk to you about how to tune spark on yarn clusters, which may not be well understood by many people. In order to make you understand better, the editor has summarized the following for you. I hope you can get something according to this article.

Jar package management: specify the jar packages that Spark.YARN.jars needs for SPARK on hdfs in the spark-default.conf file. Otherwise, each submission of the application,spark will distribute the jar.node and spark.locality.wait.pack packages under the SPARK_HOE on the driver side to each node. Waste of disk resources and network resources.

The application fails due to insufficient yarn queue resources. This kind of problem is mainly aimed at the tuning of submitting jobs:

1. In the J2EE middle tier, the job is submitted through the thread pool technology and the thread pool size is set to 1.

two。 If there is only one application, you can adjust the resources to the maximum

3. If some spark applications are obviously time-consuming, you can classify the resources owned by spark (time-consuming tasks and fast tasks), and you can use two thread pools to submit jobs, each with a thread pool size of 1.

Data localization: distributed storage is the core of big data's technology, and in order to improve the efficiency of big data's calculation, make the calculation close to the data, and reduce the network io and disk io caused by moving a large amount of data.

The local levels in spark are: PROCESS_LOCAL (process localization, most efficient), NODE_LOCAL (node localization), PACK_LOCAL (rack localization), ANY, NO_PREF. In practice, what we want is that most of the computing is process localization or node localization.

Tuning method: 1. Optimization algorithm 2. Set a reasonable number of copies of the data. Cache the commonly used rdd settings 4. Set spark related parameters spark.locality.wait,spark.locality.wait.process, spark.locality.wait.node, spark.locality.wait.pack. Run in client mode, observe the run log, so that most of the calculations are PROCESS_LOCAL, while the application runtime is reduced, which is an effective optimization. It is not available to sacrifice the application runtime in order to improve the localization level of data computing, which will also result in a large number of idle resources and long waiting time.

Executor is often dropped by kill, resulting in Container killed by YARN for exceeding memory limits, and running out of memory leads to such problems:

Remove rdd cache

The memory ratio of the spark.storage.memoryFraction:spark data cache is 0.6 by default, that is, 60% of the memory of executor can be used to persist data. After the cache reaches a critical value, the data may not be cached or written to disk. When executor is often kill, you should lower this value.

Spark.yarn.Executor.memoryoverhead: this parameter is adjusted for out-of-heap memory in yarn mode. By default, it is 10% of the memory size of each executor.

JVM stack memory overflow in YARN-Cluster mode:

JVM permanently sets Spark.Driver.extraJavaOptions= "- XX:PermSize=128M-XX:MaxPermSize=256M" (PermGen out of Memory error log) for PermGen

Simplify complex sql statements to multiple simple sql for processing (JVM stack overflow) in spark-sql

After reading the above, do you have any further understanding of how to tune spark on yarn clusters? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.