Example Analysis of troubleshooting of out-of-heap memory leakage in JVM 04/28 Update SLTechnology News&Howtos

Example Analysis of troubleshooting of out-of-heap memory leakage in JVM

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article shows you an example analysis of JVM out-of-stack memory leak troubleshooting, which is concise and easy to understand, which will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

Preface

Record the troubleshooting process and ideas of an online JVM memory leak, including some "principle analysis of JVM memory allocation" and "common JVM troubleshooting methods and tools sharing". I hope it will be helpful to you.

In the whole process of investigation, I have also taken a lot of detours, but in the article I will still write out the complete ideas and ideas as an experience and lesson for future generations' reference. Finally, the article also summarizes several principles for rapid troubleshooting of memory leaks.

Fault description

During the lunch break on August 12, our business service received an alarm that the physical memory (16g) of the container occupied by the service process exceeded the 80% threshold and was on the rise.

The monitoring system calls up the chart to view:

For example, there is a memory leak in the Java process, but the limit of our heap memory is 4G, which is larger than 4G and is about to run out of memory. It should be a memory leak outside the JVM heap.

Confirm the startup configuration of the service process at that time:

-Xms4g-Xmx4g-Xmn2g-Xss1024K-XX:PermSize=256m-XX:MaxPermSize=512m-XX:ParallelGCThreads=20-XX:+UseConcMarkSweepGC-XX:+UseParNewGC-XX:+UseCMSCompactAtFullCollection-XX:CMSInitiatingOccupancyFraction=80

Although there is no new code online that day, "that morning we are using message queuing to push a repair script for historical data, and this task will call a large number of interfaces of our service", so it is initially suspected that it has something to do with this interface.

The following figure shows the changes in the number of visits on the day the API is called:

It can be seen that the amount of transfer at the time of the incident was much higher than the normal situation (200 + times per minute) (5000 + times per minute).

"We temporarily asked the script to stop sending messages, the call to the interface dropped to 200 + times per minute, the container memory no longer rose at a very high slope, and everything seemed to be back to normal. "

Next check to see if there is a memory leak on this interface.

Investigation process

First of all, let's review the memory allocation of the Java process to facilitate the following troubleshooting ideas.

Take the JDK1.8 version we use online as an example. There are a lot of summaries on the JVM memory allocation website, so I won't create it again.

The memory area of JVM is divided into two parts: heap area and non-heap area.

Pile area: it is the old age of the new generation that we are familiar with. Non-heap area: non-heap area, as shown in the figure, has metadata area and direct memory.

"what needs to be noted here is that permanent generation (native to JDK8) holds classes used by JVM runtime, and permanent objects are garbage collected during full GC. "

After reviewing the memory allocation of JVM, let's get back to the fault.

Heap memory analysis

Although it is basically confirmed from the beginning that it has nothing to do with heap memory, because the leaked memory footprint exceeds the heap memory limit of 4G, let's take a look at the heap memory clues just to be on the safe side.

We looked at the memory usage curves of the new and old generations and the statistics of recycling times, and as usual, there was no big problem, and we then dump a log of JVM heap memory on the container at the accident site.

Heap memory Dump

Heap memory snapshot dump command:

Jmap-dump:live,format=b,file=xxxx.hprof pid

❝

Voiceover: you can also use jmap-histo:live pid to view surviving objects in the heap directly.

❞

After export, download the Dump file back locally, and then you can open the log file using Eclipse's MAT (Memory Analyzer) or JDK's native JVisualVM.

Open the file using MAT as shown in the figure:

"you can see that in the heap memory, there are some large objects related to nio, such as nioChannel that is receiving message queue messages, and nio.HeapByteBuffer, but the number is not large enough to be used as a basis for judgment. "

Next, I began to browse the interface code. The main logic inside the interface is to call the WCS client of the group to look up the data in the database table and write it to WCS. There is no other additional logic.

After discovering that there was no special logic, I began to wonder if there was a memory leak in the WCS client encapsulation. The reason for this suspicion is that the underlying WCS client is encapsulated by the SCF client, and as a RPC framework, its underlying communication transport protocol may apply for direct memory.

"is it my code that starts the Bug of the WCS client, resulting in constantly applying for direct memory calls and eventually filling up memory? "

I contacted the WCS attendant and described the problems we encountered, and they replied to us and would perform a stress test of the write operation locally to see if they could reproduce our problem.

Since it will take time to wait for their feedback, we are prepared to figure out why.

"I focus my skepticism on direct memory. I suspect that it is due to excessive interface calls and improper use of nio by the client, resulting in using ByteBuffer to request too much direct memory. "

❝

Voiceover: the final result proved that this preconceived idea led to a detour in the investigation process. In the process of troubleshooting problems, it is possible to narrow the scope of the investigation with reasonable guesses, but it is best to make a clear list of each possibility first, and when you find that you have failed to go deep into one possibility, you should go back and carefully examine other possibilities in time. "

Reproduction of ❞sandbox environment

In order to restore the failure scenario at that time, I applied for a pressure testing machine in the sandbox environment to ensure that it was consistent with the online environment.

"first of all, let's simulate the situation of memory overflow (a large number of calling interfaces):"

We let the script continue to push the data, call our interface, and we keep watching the memory footprint.

When the call is started, the memory continues to grow and does not seem to be limited (no Full GC is triggered because of the limit).

"then let's simulate the normal amount of normal usage (call the API for normal quantity):"

We cut the normal call volume of the API (relatively small, and make a batch call every 10 minutes) to the pressure test machine, and get the trend of the old generation memory and physical memory as shown below:

"the question is: why does the memory keep going up and filling up? "

The guess at the time was that because the JVM process did not limit the direct memory size (- XX:MaxDirectMemorySize), the rising out-of-heap memory would not trigger the FullGC operation.

"the above picture can draw two conclusions:"

When there is a large amount of call to the interface of memory leakage, if other conditions such as the elder generation in the heap do not meet the FullGC condition, the direct memory will not be FullGC all the way. In the case of low-key usage, the memory leak is relatively slow, and the FullGC will always come and recover the leaked part, which is also the reason why there is no problem and has been running normally for a long time.

"since as mentioned above, there is no direct memory limit in the startup parameters of our process, we added the-XX:MaxDirectMemorySize configuration and tested it again in a sandbox environment. "

It turns out that the physical memory consumed by the process continues to rise, exceeding the limit we set, and the "seemingly" configuration doesn't seem to work.

This surprises me: is there something wrong with JVM's memory restrictions?

"when I get here, I can see that my mind is obsessed with direct memory leaks in the process of troubleshooting, gone forever. "

❝

Voiceover: we should believe in JVM's mastery of memory. If we find that the parameters are invalid, we should find the reasons from ourselves to see if we are using the parameters incorrectly. "

❞direct memory analysis

In order to further investigate what is in direct memory, I began to work on direct memory. Since direct memory does not easily see all occupied objects like heap memory, we need some commands to troubleshoot direct memory. I have used several ways to see what is wrong with direct memory.

View process memory information pmap

Pmap-report memory map of a process (view the memory image information of the process)

The pmap command is used to report the memory mapping of a process and is a good tool for Linux debugging and operation and maintenance.

Pmap-x pid if you need to sort | sort-n-k3ranking *

After execution, I get the following output, and the deleted output is as follows:

00007fa2d4000000 8660 8660 8660 rw--- [anon]

00007fa65f12a000 8664 8664 8664 rw--- [anon]

00007fa610000000 9840 9832 9832 rw--- [anon]

00007fa5f75ff000 10244 10244 10244 rw--- [anon]

00007fa6005fe000 59400 10276 10276 rw--- [anon]

00007fa3f8000000 10468 10468 10468 rw--- [anon]

00007fa60c000000 10480 10480 10480 rw--- [anon]

00007fa614000000 10724 10696 10696 rw--- [anon]

00007fa6e1c59000 13048 11228 0 r Murray-libjvm.so

00007fa604000000 12140 12016 12016 rw--- [anon]

00007fa654000000 13316 13096 13096 rw--- [anon]

00007fa618000000 16888 16748 16748 rw--- [anon]

00007fa624000000 37504 18756 18756 rw--- [anon]

00007fa62c000000 53220 22368 22368 rw--- [anon]

00007fa630000000 25128 23648 23648 rw--- [anon]

00007fa63c000000 28044 24300 24300 rw--- [anon]

00007fa61c000000 42376 27348 27348 rw--- [anon]

00007fa628000000 29692 27388 27388 rw--- [anon]

00007fa640000000 28016 28016 28016 rw--- [anon]

00007fa620000000 28228 28216 28216 rw--- [anon]

00007fa634000000 36096 30024 30024 rw--- [anon]

00007fa638000000 65516 40128 40128 rw--- [anon]

00007fa478000000 46280 46240 46240 rw--- [anon]

0000000000f7e000 47980 47856 47856 rw--- [anon]

00007fa67ccf0000 52288 51264 51264 rw--- [anon]

00007fa6dc000000 65512 63264 63264 rw--- [anon]

00007fa6cd000000 71296 68916 68916 rwx-- [anon]

00000006c0000000 4359360 2735484 2735484 rw--- [anon]

As you can see, the bottom line is the mapping of heap memory, which takes up 4G, and the others have a lot of small memory footprint, but we still don't see the problem with this information.

Out-of-heap memory tracking NativeMemoryTracking ❝

Native Memory Tracking (NMT) is a function that Hotspot VM uses to analyze VM internal memory usage. We can use jcmd (included with jdk) as a tool to access NMT data.

❞

NMT must first be opened through the startup parameters of VM, but it is important to note that opening NMT will result in a performance loss of 5% and 10%.

-XX:NativeMemoryTracking= [off | summary | detail]

# off: disabled by default

# summary: only count the memory usage of each category.

# detail: Collect memory usage by individual call sites.

Then run the process, and you can view direct memory using the following command:

# summary: classifies memory usage.

# detail: details of memory usage, including virtual memory usage in addition to summary information.

# baseline: create a snapshot of memory usage for comparison with later

# summary.diff: compare with the summary of the last baseline

# detail.diff: compare with the detail of the last baseline

# shutdown: close NMT

We use:

Jcmd pid VM.native_memory detail scale=MB > temp.txt

The result is as shown in the figure:

None of the information given to us in the picture above can clearly see the problem, at least I still can't see the problem through these messages at that time.

The investigation seems to have reached an impasse.

Where hills bend, streams wind and the pathway seems to end

When the troubleshooting came to a standstill, we got responses from WCS and SCF, "both sides confirmed that there was no memory leak in their encapsulation", WCS did not use direct memory, and although SCF was the underlying RPC protocol, it would not leave such obvious memory bug, otherwise there should be a lot of feedback online.

View JVM memory information jmap

At this point, when I couldn't find the problem, I opened a new sandbox container, ran the service process, and then ran the jmap command to take a look at the "actual configuration" of JVM memory:

Jmap-heap pid

The results are obtained:

Attaching to process ID 1474, please wait...

Debugger attached successfully.

Server compiler detected.

JVM version is 25.66-b17

Using parallel threads in the new generation.

Using thread-local object allocation.

Concurrent Mark-Sweep GC

Heap Configuration:

MinHeapFreeRatio = 40

MaxHeapFreeRatio = 70

MaxHeapSize = 4294967296 (4096.0MB)

NewSize = 2147483648 (2048.0MB)

MaxNewSize = 2147483648 (2048.0MB)

OldSize = 2147483648 (2048.0MB)

NewRatio = 2

SurvivorRatio = 8

MetaspaceSize = 21807104 (20.796875MB)

CompressedClassSpaceSize = 1073741824 (1024.0MB)

MaxMetaspaceSize = 17592186044415 MB

G1HeapRegionSize = 0 (0.0MB)

Heap Usage:

New Generation (Eden + 1 Survivor Space):

Capacity = 1932787712 (1843.25MB)

Used = 1698208480 (1619.5378112792969MB)

Free = 234579232 (223.71218872070312MB)

87.86316621615607 used

Eden Space:

Capacity = 1718091776 (1638.5MB)

Used = 1690833680 (1612.504653930664MB)

Free = 27258096 (25.995346069335938MB)

98.41346682518548 used

From Space:

Capacity = 214695936 (204.75MB)

Used = 7374800 (7.0331573486328125MB)

Free = 207321136 (197.7168426513672MB)

3.4349974840697497% used

To Space:

Capacity = 214695936 (204.75MB)

Used = 0 (0.0MB)

Free = 214695936 (204.75MB)

0.0% used

Concurrent mark-sweep generation:

Capacity = 2147483648 (2048.0MB)

Used = 322602776 (307.6579818725586MB)

Free = 1824880872 (1740.3420181274414MB)

15.022362396121025% used

29425 interned Strings occupying 3202824 bytes

In the output information, we can see that both the old age and the new generation are quite normal, the meta-space only takes up 20m, and the direct memory also seems to be 2g.

Yeah? Why MaxMetaspaceSize = 17592186044415 MB? It looks like there is no limit.

Take a closer look at our startup parameters:

-Xms4g-Xmx4g-Xmn2g-Xss1024K-XX:PermSize=256m-XX:MaxPermSize=512m-XX:ParallelGCThreads=20-XX:+UseConcMarkSweepGC-XX:+UseParNewGC-XX:+UseCMSCompactAtFullCollection-XX:CMSInitiatingOccupancyFraction=80

The configuration is-XX:PermSize=256m-XX:MaxPermSize=512m, which is the memory space of the permanent generation. "since 1.8, the Hotspot virtual machine has been removed from the permanent generation and used metaspace instead. Because we use JDK1.8 online, "we have no limit on the maximum capacity of metaspace at all."-XX:PermSize=256m-XX:MaxPermSize=512m these two parameters are expired parameters for 1.8.

The following figure describes the change in the permanent generation from 1.7 to 1.8:

"could it be that metaspace memory has been leaked? "

I chose to test locally to make it easy to change the parameters and to use the JVisualVM tool to see the memory changes visually.

Use JVisualVM to watch the process run

First limit the metaspace, use the parameter-XX:MetaspaceSize=64m-XX:MaxMetaspaceSize=128m, and then call the problematic interface locally.

The result is as follows:

"as you can see, when the meta-space is exhausted, the system starts Full GC, the meta-space memory is reclaimed, and many classes are unloaded. "

Then we remove the metaspace restriction, that is, using the parameters that went wrong before:

-Xms4g-Xmx4g-Xmn2g-Xss1024K-XX:ParallelGCThreads=20-XX:+UseConcMarkSweepGC-XX:+UseParNewGC-XX:+UseCMSCompactAtFullCollection-XX:CMSInitiatingOccupancyFraction=80-XX:MaxDirectMemorySize=2g-XX:+UnlockDiagnosticVMOptions

The result is as follows:

It can be seen that meta-space is constantly rising, and the loaded classes are also rising with the increase of the amount of transfer, showing a positive correlation trend. "

Past dark willows and flowers in bloom lies another village

The problem suddenly became clear, "with each call to the interface, it is very likely that a class is constantly being created, taking up meta-space memory."

Observe the JVM class load-verbose ❝

When debugging a program, you sometimes need to check the classes loaded by the program, the memory collection, the local interface called, and so on. The-verbose command is needed at this point. In myeclipse, you can right-click (below) or type java-verbose on the command line to view it.

❞- verbose:class to view class loading

-verbose:gc to view memory recovery in a virtual machine

-verbose:jni to view local method calls

In the local environment, we add the startup parameter-verbose:class loop invocation interface.

You can see that numerous com.alibaba.fastjson.serializer.ASMSerializer_1_WlkCustomerDto have been generated:

[Loaded com.alibaba.fastjson.serializer.ASMSerializer_1_WlkCustomerDto from file:/C:/Users/yangzhendong01/.m2/repository/com/alibaba/fastjson/1.2.71/fastjson-1.2.71.jar]

When we call many times and accumulate certain classes, we manually execute Full GC to reclaim the class loader, and we find that a large number of fastjson-related classes are recycled.

"if you use jmap to check class loading before recycling, you can also find a large number of fastjson-related classes:"

Jmap-clstats 7984

Now there is a direction. "take a closer look at the code this time" to see where fastjson is used in the code logic, and find the following code:

/ * *

* returns a Json string. Hump turn _

* @ param bean entity class.

, /

Public static String buildData (Object bean) {

Try {

SerializeConfig CONFIG = new SerializeConfig ()

CONFIG.propertyNamingStrategy = PropertyNamingStrategy.SnakeCase

Return jsonString = JSON.toJSONString (bean, CONFIG)

} catch (Exception e) {

Return null

}

Root cause of the problem

We serialize the entity class of the hump field into an underlined field before calling wcs. * * this requires the use of fastjson's SerializeConfig, which we instantiated in the static method. When SerializeConfig is created, an ASM proxy class is created by default to serialize the target object. That is, the class com.alibaba.fastjson.serializer.ASMSerializer_1_WlkCustomerDto that is frequently created above. If we reuse SerializeConfig,fastjson, we will look for the proxy class that has been created and reuse it. But if new SerializeConfig (), the original generated proxy class cannot be found, the new WlkCustomerDto proxy class will be generated all the time.

The source code of the problem location in the following two pictures:

We used SerializeConfig as the static variable of the class, and the problem was solved.

Private static final SerializeConfig CONFIG = new SerializeConfig ()

Static {

CONFIG.propertyNamingStrategy = PropertyNamingStrategy.SnakeCase

}

What did fastjson SerializeConfig do?

SerializeConfig introduction:

❝

The main function of SerializeConfig is to configure and record the serialization class (the implementation class of the ObjectSerializer interface) corresponding to each Java type. For example, Boolean.class uses BooleanCodec (you can see that this class writes the serialization and deserialization implementation together) as the serialization implementation class, and float [] .class uses FloatArraySerializer as the serialization implementation class. Some of these serialization implementation classes are implemented by default in FastJSON (such as Java basic classes), some are generated by ASM framework (such as user-defined classes), and some are even user-defined serialization classes (for example, the default implementation of Date type framework is converted to milliseconds, applications need to be converted to seconds). Of course, this involves the question of whether to use ASM to generate serialized classes or use JavaBean to serialize serialized classes. Here, the judgment is based on whether the Android environment (the environment variable "java.vm.name" is "dalvik" or "lemur" is the Android environment), but the judgment is not only here, but there is a more specific judgment later.

❞

In theory, if each SerializeConfig instance serializes the same class, it will find the proxy class of the previously generated class for serialization. Each time the interface is called, our services instantiate a ParseConfig object to configure the setting of Fastjson anti-sequence, but without disabling the ASM proxy, because each call to ParseConfig is a new instance, the proxy class that has been created can never be checked, so Fastjson keeps creating new proxy classes and loading them into metaspace, resulting in metaspace expanding and running out of memory.

The problem will only occur if you upgrade JDK1.8

The cause of the problem is still worthy of attention. Why doesn't this problem occur before the upgrade? This is to analyze the difference between jdk1.8 and the hotspot virtual machine that comes with 1.7.

❝

Starting from jdk1.8, the built-in hostspot virtual machine cancels the old permanent zone and adds a new metaspace. from a functional point of view, metaspace can be considered to be similar to the permanent zone, its main function is to store class metadata, but the actual mechanism is quite different.

First of all, the default maximum value of metaspace is the physical memory size of the entire machine, so the continuous expansion of metaspace will cause java programs to encroach on the available memory of the system, and eventually the system has no available memory, while the permanent zone has a fixed default size and will not expand to the available memory of the entire machine. When the allocated memory is exhausted, both of them will trigger full gc, but the difference is that when the permanent zone is in full gc, the class metadata (Class objects) in the permanent area can be recycled with a similar mechanism when heap memory is reclaimed, as long as the object that the root reference cannot reach can be recycled, while metaspace determines whether the class metadata can be recycled according to whether the Classloader loaded with these class metadata can be recycled, as long as the Classloader cannot. The class metadata loaded through it will not be recycled. This explains why there are problems with our two services only after upgrading to 1.8, because in the previous version of jdk, although many proxy classes were created every time fastjson was called, and many Class instances of proxy classes were loaded in the permanent zone, these Class instances were created when the method call was made, and became unreachable after the call, so when the triggered full gc is full in the permanent zone, it will be recycled.

When using 1.8, because these proxy classes are loaded through the Classloader of the main thread, the Classloader will never be recycled while the program is running, so the proxy classes loaded through it will never be recycled, which causes metaspace to expand and eventually deplete the machine's memory.

❞

This problem is not limited to fastjson, but can occur wherever a class is created by program loading. "especially in frameworks, a large number of tools such as ASM and javassist are often used for bytecode enhancement. According to the above analysis, before jdk1.8, because dynamically loaded Class can be recycled during full gc in most cases, it is not prone to problems." therefore, many frameworks and toolkits do not deal with this problem, once upgraded to 1.8. These problems may be exposed.

The problem is solved, then review the whole troubleshooting process, the whole process exposed a lot of problems, the most important thing is "not familiar with the memory allocation of different versions of JVM", resulting in misjudgment for the elder generation and metaspace, took a lot of detours, checked in direct memory for a long time, wasted a lot of time.

Secondly, the investigation needs "one is careful, the other is comprehensive," and it is best to sort out all the possibilities first, otherwise it is easy to fall into the scope of the investigation set by yourself and cannot get out of the dead end.

Finally, I would like to summarize the benefits of this problem:

Since the beginning of JDK1.8, the built-in hostspot virtual machine has cancelled the old permanent zone and added a new metaspace. functionally, metaspace can be considered to be similar to the permanent zone, and its main function is to store class metadata, but the actual mechanism is quite different. The memory in JVM needs to be limited at startup, including the familiar heap memory, as well as direct memory and meta-zone, which is the last step to ensure the normal operation of online services. When using class libraries, please pay more attention to the writing of the code and try not to have obvious memory leaks. For class libraries that use bytecode enhancement tools such as ASM, be careful when using them (especially after JDK1.8). The above is a sample analysis of troubleshooting memory leaks outside the JVM heap. Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.