How to solve and think about the problem of a long suspension of GC in the younger generation 04/17 Update SLTechnology News&Howtos

How to solve and think about the problem of a long suspension of GC in the younger generation

2025-04-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to solve and think about a young generation of GC long suspension problem, I believe that many inexperienced people do not know what to do. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

Problem description

A rule engine system of the company warms up manually every time the release is launched, and when the traffic is cut in, there will be an occasional 1-2 second-long young generation of GC (the traffic is not large, and this happens to every service under LB).

After this long pause, the GC pause time of each younger generation returned to less than 20-100ms.

Although 2s does not look long, it is not acceptable compared with the response time of each 10ms of the rule engine, and because the response time of the rule engine times out, it will lead to the failure of the order timeout.

Analysis of problems

After analyzing the GC log of the system, it is found that the 2s pause occurs in the Young GC phase, and each long pause Young GC will be accompanied by the promotion of the new generation of objects (Promotion).

Core JVM parameters (Oracle JDK7)

-Xms10G

-Xmx10G

-XX:NewSize=4G

-XX:PermSize=1g

-XX:MaxPermSize=4g

-XX:+UseConcMarkSweepGC

Copy the code

The first young generation GC log after startup

2020-04-23T16:28:31.108+0800: [GC2020-04-23T16:28:31.108+0800: [ParNew2020-04-23T16:28:31.229+0800: [SoftReference, 0 refs, 0.0000950 secs] 2020-04-23T16:28:31.229+0800: [WeakReference, 1156 refs, 0.0001040 secs] 2020-04-23T16:28:31.229+0800: [FinalReference, 10410 refs, 0.0103720 secs] 2020-04-23T16:28:31.240+0800: [PhantomReference, 286 refs, 2 refs 0.0129420 secs] 2020-04-23T16:28:31.253+0800: [JNI Weak Reference, 0.0000000 secs]

Desired survivor size 214728704 bytes, new threshold 1 (max 15)

-age 1: 315529928 bytes, 315529928 total

-age 2: 40956656 bytes, 356486584 total

-age 3: 8408040 bytes, 364894624 total

: 3544342K-> 374555K (3774912K), 0.1444710 secs] 3544342K-> 374555K (10066368K), 0.1446290 secs] [Times: user=1.46 sys=0.09, real=0.15 secs]

Copy the code

Long pause young generation GC log

2020-04-23T17:18:28.514+0800: [GC2020-04-23T17:18:28.514+0800: [ParNew2020-04-23T17:18:29.975+0800: [SoftReference, 0 refs, 0.0000660 secs] 2020-04-23T17:18:29.975+0800: [WeakReference, 1224 refs, 0.0001400 secs] 2020-04-23T17:18:29.975+0800: [FinalReference, 8898 refs, 0.0149670 secs] 2020-04-23T17:18:29.990+0800: [PhantomReference, 1224 refs, 1 refs 0.0344300 secs] 2020-04-23T17:18:30.025+0800: [JNI Weak Reference, 0.0000210 secs]

Desired survivor size 214728704 bytes, new threshold 15 (max 15)

-age 1: 79203576 bytes, 79203576 total

: 3730075K-> 304371K (3774912K), 1.5114000 secs] 3730075K-> 676858K (10066368K), 1.5114870 secs] [Times: user=6.32 sys=0.58, real=1.51 secs]

Copy the code

Judging from this long-suspended GC log, promotion occurred. After Young GC, 363m + objects were promoted to the old age, and this promotion operation should be the reason for the time-consuming (ps: checked the safepoint reason, there is no exception)

Since the-XX:+PrintHeapAtGC parameter is not configured in the log parameters, here is the manually calculated promotion size:

Young generation young change-whole heap capacity change = promotion size

(304371K-3730075K)-(676858K-3730075K) = 372487K (363m)

The next younger generation GC log

2020-04-23T17:23:39.749+0800: [GC2020-04-23T17:23:39.749+0800: [ParNew2020-04-23T17:23:39.774+0800: [SoftReference, 0 refs, 0.0000500 secs] 2020-04-23T17:23:39.774+0800: [WeakReference, 3165 refs, 0.0002720 secs] 2020-04-23T17:23:39.774+0800: [FinalReference, 3520 refs, 0.0021520 secs] 2020-04-23T17:23:39.776+0800: [PhantomReference, 150 refs, 1 refs 0.0051910 secs] 2020-04-23T17:23:39.782+0800: [JNI Weak Reference, 0.0000100 secs]

Desired survivor size 214728704 bytes, new threshold 15 (max 15)

-age 1: 17076040 bytes, 17076040 total

-age 2: 40832336 bytes, 57908376 total

: 3659891K-> 90428K (3774912K), 0.0321300 secs] 4032378K-> 462914K (10066368K), 0.0322210 secs] [Times: user=0.30 sys=0.00, real=0.03 secs]

Copy the code

At first glance, there is nothing wrong with it. If you think about it, you have found something abnormal. Why did you get promoted just after the program was started for the second time gc?

This should be caused by dynamic age determination. The promotion age threshold in GC is not fixed 15, but is dynamically calculated by jvm after each gc.

Promotion mechanism of the younger generation

In order to better adapt to the memory conditions of different programs, the virtual machine does not always require that the age of the object must reach MaxTenuringThreshold in order to promote the old age. If the total size of all objects of the same age in the Survivor space is more than half of the Survivor space, the objects whose age is greater than or equal to that age can directly enter the old age without waiting for the age required in the MaxTenuringThreshold.

As mentioned in the book "in-depth understanding of the Java Virtual Machine", the threshold for an object's promotion age is determined dynamically.

However, after consulting other materials and verification, it is found that this is somewhat different from the explanation of "in-depth understanding of the Java virtual machine" (or the explanation in the book is not clear enough).

In fact, the object is grouped by age, and the age with the largest total (cumulative value, less than the total size of the object equal to the current age) is grouped. If the total of the group is more than half of the survivor, the promotion age threshold is updated to the age of the group.

Note: not more than half of survivor will be promoted. More than half of survivor will only reset the promotion threshold (threshold). The new threshold will not be used until the next GC.

3544342K-> 374555K (3774912K), 0.1444710 secs] Young generation

3544342K-> 374555K (10066368K), 0.1446290 secs] full heap

Copy the code

This conclusion can also be proved from the first GC log above. In this GC, the memory change of the whole heap is the same as that of the younger generation, so there is no object promotion.

As in the log above, the first GC only sets threshold to 1, because half of the survivor is 214728704 bytes, and the total of objects aged 1 is 315529928 bytes, which exceeds Desired survivor size, so set threshold to age 1 after this GC.

The promotion age threshold of 1 is updated here.

Desired survivor size 214728704 bytes, new threshold 1 (max 15)

-age 1: 315529928 bytes, 315529928 total

-age 2: 40956656 bytes, 356486584 total

-age 3: 8408040 bytes, 364894624 total

Copy the code

By the way, explain the output of this age distribution:

-age 1: 315529928 bytes, 315529928 total

Copy the code

Age 1 represents the grouping of objects of age 1, and 315529928 bytes indicates the amount of memory occupied by objects of age 1.

315529928 total this is a cumulative value that represents the total size of objects less than or equal to the current grouping age. First, the objects are grouped by age. The grouping total of age 1 is the total size of age 1 (the previous xxx bytes), the grouping total of age 2 is the total size of age 1 + age 2, and the grouping total of age n is the total size of age 1 + age 2 +. + age n. The accumulation rule is shown in the following figure.

When the total value of the largest packet in total exceeds survivor/2, the promotion threshold is updated

In the second young generation GC "long pause Young Generation GC Log", because the new promotion age threshold is 1, those who have survived a GC and are still reachable will be promoted.

Due to the promotion of 363m object in this GC, it leads to a long suspension.

Thinking

Is this "dynamic object age determination" in JVM really reasonable? Personally, I think the mechanism is good and can better adapt to the memory conditions of different programs, but it is not suitable for any scenario. For example, in this article, there will be problems in the scenario where GC is not just started.

Because at the beginning of the program, most of the objects are 0 or 1, it is easy to have a large number of living objects with the age of 1; under this "dynamic object age determination" mechanism, it will cause the new promotion threshold to be set to 1, which leads to the promotion of these objects who should not be promoted.

For example, Young GC occurs when the program is initializing and loading various resources, the loading logic is still being executed, and the age of many newly created objects can still be reached (reachable) at this time of GC.

After this GC, the age of these subjects is updated to 1, but due to the influence of the "dynamic object age determination" mechanism, the promotion age threshold is updated to the age of the "largest object age group", that is, those who have just experienced a GC.

Shortly after this GC, the resource initialization is completed, and the relevant objects involved are likely to be unreachable, but since the promotion age threshold has just been updated to 1, in the next normal Young GC, this group of 1-year-old objects will be promoted directly, leading to early or wrong promotion.

Solution

After consulting documents and materials, it is found that the mechanism of "dynamic age determination" can not be disabled, so if we want to solve this problem, we can only rely on "bypassing" this calculation rule.

The determination of dynamic age is based on the fact that the sum of the sizes of all objects of the same age in the Survivor space is more than half of the Survivor space, so it is very simple to solve the problem according to this mechanism.

Because we know enough about our system and know clearly the approximate memory needed to load resources, we can set a value greater than the sum of these temporarily reachable objects as the capacity of survivor.

For example, in the log above, the number of objects aged 1 after the first GC is 315529928 Bytes (300m), and the Desired survivor size is (survivor size / 2) 214728704 bytes (204m), then the survivor can be set to more than 600m.

However, in order to be safe, set the survivor to 800m, so that the desired survivor size is about 400m. After the first Young GC, the promotion age threshold will not be updated because the sum of the objects of age 1 exceeds desired survivor size, thus there will not be the problem of long suspension of GC caused by early / wrong promotion.

Survivor cannot specify the size directly, but the survivor size can be adjusted by adjusting the scale of-XX:SurvivorRatio.

-XX:SurvivorRatio=8

Represents the ratio of two Survivor and Edgen regions, and 8 represents two Survivor:Eden=2:8, that is, one Survivor accounts for 1max 10 of the Cenozoic era.

The method of calculation is:

Survivor Size (1) = Young Generation Size / (2+SurvivorRatio)

Eden Size = Young Generation Size / (2+SurvivorRatio) * SurvivorRatio

Copy code extension reading

Why is the promotion of 300m so many times slower than the recovery of 3G by the younger generation?

According to the characteristics of the replication algorithm, the time consumption of the replication algorithm mainly depends on the size of the living object, not the total space.

For example, for the younger generation of the above 4G (only Eden+S0 is actually available), GC only needs to traverse the object graph from GC ROOTS and copy the reachable objects to S1. There is no need to traverse the entire younger generation.

In the above long pause GC log, there was a promotion of 363m and a recovery of about 300m. Compared with the first GC, it can be concluded that the 1.5s spent are basically in the promotion operation.

So why is the promotion so time-consuming?

There is no in-depth study of the details of the promotion of the younger generation realized by Oracle JVM, but the promotion involves inter-generational replication (in fact, both the younger generation and the old generation are heap, there is no difference in the matter of replication, it is just memcpy, but there is more logic that needs additional processing)

The logic that needs to be processed will be more complex, such as pointer updates, which is understandably more time-consuming

Native code simulation

Here is also a piece of code that can simulate the problem locally, and the test can be run directly under Oracle JDK7

/ / jdk7. .

Import java.io.IOException

Import java.util.ArrayList

Import java.util.HashMap

Import java.util.List

Public class PromotionTest {

Public static void main (String [] args) throws IOException {

/ / simulate and initialize the resource scene

List dataList = new ArrayList ()

For (int I = 0; I < 5; iTunes +) {

DataList.add (new InnerObject ())

}

/ / simulate traffic to enter the scenario

For (int I = 0; I < 73; iTunes +) {

If (I = = 72) {

System.out.println ("Execute young gc...Adjust promotion threshold to 1")

}

New InnerObject ()

}

System.out.println ("Execute full gc...dataList has been promoted to cms old space")

/ / Note here that the objects in dataList will enter the old age after this Full GC.

System.gc ()

}

Public static byte [] createData () {

Int dataSize = 1024mm 1024mm 4m

Byte [] data = new byte [dataSize]

For (int j = 0; j < dataSize; jacks +) {

Data [j] = 1

}

Return data

}

Static class InnerObject {

Private Object data

Public InnerObject () {

This.data = createData ()

}

Copy the code

Jvm options

-server-Xmn400M-XX:SurvivorRatio=9-Xms1000M-Xmx1000M-XX:+PrintGCDetails-XX:+PrintGCDateStamps-XX:+PrintTenuringDistribution-XX:+PrintHeapAtGC-XX:+PrintReferenceGC-XX:+PrintGCApplicationStoppedTime-XX:+UseConcMarkSweepGC

After reading the above, do you know how to solve and think about the problem of long suspension of GC in the younger generation? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.