What is the tuning principle of spark-JVM? 04/26 Update SLTechnology News&Howtos

What is the tuning principle of spark-JVM?

2025-04-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

Today, I will talk to you about the tuning principle of spark-JVM, which may not be well understood by many people. in order to make you understand better, the editor has summarized the following content for you. I hope you can get something according to this article.

Performance tuning

General performance tuning: allocation of resources, parallelism. Etc.

JVM tuning (Java virtual machine): parameters related to JVM. Generally speaking, if your hardware configuration and basic JVM configuration are all right, JVM usually does not cause too serious performance problems; on the contrary, in troubleshooting, JVM occupies a very important position; JVM causes errors or even failures of online spark jobs (such as OOM).

Shuffle tuning (very important): spark tuning in the shuffle link when performing groupByKey, reduceByKey and other operations. This is very important. Shuffle tuning, in fact, the impact on the performance of spark jobs is quite high! Experience: in the process of running a spark job, as long as the shuffle operation is involved, the performance consumption of the shuffle operation accounts for 50%-90% of the whole spark job. 10% is used to run operations such as map, and 90% is spent on shuffle operations.

Spark operation tuning (spark operator tuning, more important): the performance of some operators is higher than that of others. ForeachPartition replaces foreach.

If you encounter the right situation, the effect is still good.

1. Allocate resources, parallelism, RDD architecture and caching

2. Shuffle tuning

3. Spark operator tuning

4. JVM tuning and broadcasting large variables.

Overview of the principle of JVM tuning.

All official recommendations in JVM tuning are recommended to reduce the proportion of cache operations.

Theoretical basis: spark is developed in scala. Don't think that scala has nothing to do with java at all. This is a common mistake. Spark's scala code calls a lot of java api. Scala also runs in the java virtual machine. Spark runs in the java virtual machine. What kind of problem the java virtual machine may cause: out of memory?! Our RDD cache, task runs defined operator functions, and may create a lot of objects. May take up a lot of memory, if not done well, it may lead to JVM problems.

Heap memory:

Store some of the objects we created. Heap memory is divided into young belt young generation and old belt old generation, and the young band is divided into three parts. The Eden area is relatively large, and the two survivor regions are relatively small survival areas. We perform operator functions in spark task (we wrote our own operations for RDD). We may create a lot of objects, all of which are to be put into the younger generation of JVM. Each time an object is placed, it is placed in the eden area, and one of the survivor areas; the other survivor area is free. When the eden area and a survivor area are full (too many objects are generated during the spark run), minor gc, a small garbage collection, is triggered. The garbage collector gc clears objects that are no longer in use from memory to make room for newly created objects later.

After cleaning up objects that are no longer in use, the surviving objects (and those that will continue to be used) will also be placed in the previously free survivor area. There may be a problem here. The default memory share for eden, survior1, and survivor2 is 8:1:1. The problem is that if the surviving object is 1. 5, a survivor area won't fit. At this point, it is possible to put redundant objects directly into the old age through the guarantee mechanism of JVM (the behavior that different versions of JVM may correspond to).

If your JVM memory is not large enough, it may lead to frequent memory overflows in the younger generation and frequent minor gc. Frequent minor gc will result in some surviving objects that will not be collected in a short period of time. That is, those who have been in use but cannot be released are tossed back and forth frequently! Will lead to this short declaration cycle (actually not necessarily for long-term use) objects, each time recycled, the age is one year older! Too old, too many times of garbage collection has not yet been recycled, run to the old age.

To put it bluntly, the object of the short declaration cycle has gone to the old age! It was originally a short cycle, but as a result, it went back and forth to the old age, which ideally put some objects with a long life cycle, and the number should be very small. For example, database connection pool, database connection pool is very few.

In short, in the old days, there may be a lot of short-life objects that should have been recycled in the younger generation because of a shortage of memory. At this time, it may lead to frequent overflowing of the old age. Frequent full gc (global / full garbage collection). Full gc will recycle objects from the old days. Full gc because the design of this algorithm is aimed at the small number of objects in the old days, the frequency of overflowing full gc should be very little, so it adopts a garbage collection algorithm that is not too complex, but consumes performance and time. Full gc is slow.

Full gc / minor gc, whether fast or slow, causes the worker thread of jvm to stop working, stop the world. In short, when you gc, spark stops working. Waiting for the garbage collection to end.

When there is not enough memory, the problem is:

Frequent minor gc can also cause frequent spark to stop working

In the old days, the hoarding of a large number of active objects (objects with a short life cycle) led to frequent full gc,full gc for a long time, ranging from tens of seconds to minutes or even hours. May cause spark to stop working for a long time

Seriously affect the performance and running speed of our spark.

How to solve?

The first point of JVM tuning: reducing the memory share of cache operations

In spark, heap memory is divided into two pieces, one is dedicated to caching RDD data for RDD's cache and persist operations, and the other, as we just said, is used for the operation of spark operator functions to store objects created by themselves in the function.

By default, the memory share for RDD cache operations is 0.6. 60% of the memory is given to cache operations. But the problem is, if in some cases, cache is not so nervous, the problem is that too many objects are created in the task operator function, and then the memory is not too large, resulting in frequent minor gc, or even frequent full gc, causing spark to stop working frequently. The performance impact can be significant.

In view of the above situation, you can check it at spark uich. If you want to run yarn, then check the running statistics of your spark jobs through the interface of yarn. It's very simple, just click in layer by layer. You can see the operation of each stage, including the run time of each task, gc time, and so on. If GC is found to be too frequent, the time is too long. At this point, the ratio can be adjusted appropriately. To reduce the memory share of cache operations, you can only use persist operations to write part of the cached RDD data to disk, or serialization, in conjunction with Kryo serialization classes, to reduce the memory footprint of RDD cache; reduce the memory share of cache operations; accordingly, the memory share of operator functions is increased. At this point, it is possible to reduce the frequency of minor gc and the frequency of full gc at the same time. It is helpful to improve the performance to some extent. In a word, when you let task execute operator functions, there is more memory available.

Spark.storage.memoryFraction,0.6-> 0.5-> 0.4-> 0.2

You can adjust it by yourself, and then observe the running statistics of spark jobs! Then see if the overall running time has improved! Whether gc is frequent, gc time, etc.! The above proportion can be adjusted! Do it according to different needs!

Set ("spark.storage.memoryFraction", "0.5") after reading the above, do you have any further understanding of the tuning principle of spark-JVM? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.