How to realize online JVM tuning 04/17 Update SLTechnology News&Howtos

How to realize online JVM tuning

2025-04-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly explains "how to achieve online JVM tuning", interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let Xiaobian take you to learn "how to realize online JVM tuning"!

YGC takes too long to investigate and solve cases

After the new version of our advertising service was launched, we received a large number of service timeout alarms. As you can see from the monitoring chart below, the timeout suddenly increased in a large area, and even reached thousands of interface timeout times within one minute. The following is a detailed description of the troubleshooting process for this problem.

1. inspection and monitoring

After receiving the alarm, we checked the monitoring system as soon as possible and immediately found that YoungGC took too long. Our program goes online at about 21:50. As can be seen from the figure below, YGC is basically completed in tens of milliseconds before going online, while YGC takes significantly longer after going online, even reaching more than 3 seconds at most.

Since the program will Stop The World during YGC, and the service timeout time set by our upstream system is several hundred milliseconds, it is inferred that YGC takes too long to cause a large area of service timeout.

Following the normal GC troubleshooting process, we immediately removed a node and dumped the heap memory file to preserve the scene with the following command.

jmap -dump:format=b,file=heap pid

Finally, the online service was rolled back. After the rollback, the service immediately returned to normal, followed by a one-day troubleshooting and repair process.

2. Confirm JVM configuration

We checked the JVM parameters again with the following command

You can see that the heap memory is 4G, both the Cenozoic and the older generations are 2G, and the Cenozoic uses the ParNew collector.

By jmap -heap pid, it is found that Eden region of Cenozoic is 1.6G, and S0 and S1 regions are 0.2G.

This launch did not modify any parameters related to JVM, and the number of requests for our services was basically the same as usual. So guess: this problem is most likely related to the code on the line.

3. code inspection

Returning to the principle of YGC to think about this problem, the process of YGC mainly includes the following two steps:

Scan objects from GC Root, annotate surviving objects

Copy surviving objects to S1 or promote them to Old

According to the monitoring chart below, it can be seen that under normal circumstances, the utilization rate of Survivor Zone has been maintained at a very low level (about 30M), but after going online, the utilization rate of Survivor Zone begins to fluctuate, and the most time is almost 0.2 G. Moreover, the time consuming of YGC is positively correlated with the utilization rate of Survivor area. Therefore, we speculate that the increasing number of objects that should have long lifetimes leads to an increase in the time-consuming labeling and replication process.

Back to the overall performance of the service: there is no significant change in upstream traffic. Under normal circumstances, the response time of the core interface is basically within 200ms, and the frequency of YGC is about once every 8 seconds.

Obviously, local variables can be recycled immediately after each YGC. So why do so many objects survive YGC?

We further lock suspect objects to either global variables or quasi-static variables of the program. But diff the code for this launch, we didn't find any such variables introduced in the code.

4. Analyze dump heap memory files

After the troubleshooting didn't make any progress, we started looking for clues from the heap memory file. After importing the heap file dumped in step 1 with MAT tool, we saw all the large objects in the current heap through Dominator Tree view.

Immediately found that NewOldMappingService this class occupies a lot of space, through the code positioning: this class is located in the third-party client package, provided by our company's product team, used to achieve the conversion of old and new categories (recently the product team is transforming the category system, in order to be compatible with the old business, need to map the old and new categories).

Looking further at the code, we found that there are a large number of static HashMaps in this class, which are used to cache various data needed when converting old and new categories, so as to reduce RPC calls and improve conversion performance.

Originally thought, very close to the truth of the problem, but in-depth investigation found: all static variables of this class are initialized when the class is loaded, although it will occupy more than 100 M of memory, but after that, basically no new data will be added. Also, this class went live as early as March, and the version of the client package has not changed.

After all the above analysis, the static HashMap of this class will always survive. After many rounds of YGC, it will eventually be promoted to the old age. It should not be the reason why YGC lasts too long. Therefore, we have ruled out this suspicious point for the time being.

5. Time consuming analysis of YGC processing Reference

The team had very little experience in troubleshooting YGC problems and did not know how to analyze them further. Basically scanning all the cases available on the Internet, it was found that the reasons were concentrated in these two categories:

Labeling surviving objects takes too long: for example, overloading the Finalize method of the Object class, causing it to take too long to label Final Reference; or improper use of the String.intern method, causing YGC to scan StringTable too long.

Excessive accumulation of long-lived objects: for example, improper use of local cache, accumulation of too many surviving objects, or lock contention caused thread blocking, local variable life cycle becomes longer.

For Type 1 problems, you can show how long GC takes to process Reference with the following parameter-XX:+PrintReferenceGC. After adding this parameter, you can see that different types of reference processing time is very short, so this factor is excluded.

6. Go back to the long-period object for analysis

After that, we added various GC parameters to try to find clues, but there was no result. It seemed that we had no idea. Combined monitoring and analysis: only long-term objects should cause us this problem.

After several hours of toiling, the peak finally turned around, and a small partner found a second suspect point from MAT heap memory again.

As you can see from the screenshot above, the ConfigService class ranked third among the large objects has entered our field of vision. An ArrayList variable of this class actually contains 270W objects, and most of them are the same elements.

ConfigService This class is in the third-party Apollo package, but the source code has been modified twice by the company architecture department. It can be seen from the code that the problem lies in line 11. Every time the getConfig method is called, elements are added to the List, and no reprocessing is done.

Our ad service stores a large number of ad policy configurations in apollo, and most requests call ConfigService's getConfig method to get the configuration, so it constantly adds new objects to the static variable namespaces, causing this problem.

At this point, the whole problem finally came to light. This BUG was introduced accidentally by the architecture department when customizing the apollo client package. Obviously, it was not carefully tested and released to the central repository just one day before we went live. The version of the company's basic component library was uniformly maintained through super-pom, and the business was not aware.

7. solutions

To quickly verify that YGC took too long because of this problem, we replaced it directly with the old version of apollo client package on one server, then restarted the service and observed it for nearly 20 minutes, and YGC returned to normal.

Finally, we notified the architecture department to fix the BUG and re-released the super-pom, which completely solved the problem.

At this point, I believe everyone has a deeper understanding of "how to achieve online JVM tuning," so let's actually operate it! Here is the website, more related content can enter the relevant channels for inquiry, pay attention to us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.