Troubleshooting and Analysis of Java memory leakage 07/03 Update SLTechnology News&Howtos

Troubleshooting and Analysis of Java memory leakage

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article introduces the relevant knowledge of "troubleshooting and Analysis of Java memory leaks". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

I. Origin

A few days ago, the group arranged to be on duty to take turns to take care of our service, mainly to do some alarm mail processing, Bug inspection, operation issue processing. The working day is fine. I have to go to work no matter what I do. If it is my turn to go to work on weekends, that day will be ruined.

I don't know if the company has a wide network, or the network operation and maintenance team is weak, there are always problems with the network, either the switch here is off the network or the router there is broken, and there are all kinds of timeouts occasionally. and our sensitive service detection service can always accurately grasp the occasional small problems and add some materials to the good work. Several times the friends of the duty team complained together, discussing how to avoid the service survival mechanism and secretly shut down the detection service without being detected (though they dared not).

I just dealt with a pot of exploration service on the weekend a few days ago.

Second, the question

1. Network problems?

Starting after seven o'clock in the evening, I began to receive alarm emails constantly, which showed that several interfaces of the probe had timed out. Most execution stacks are:

I have seen many errors in this thread stack. The HTTP DNS timeout we set is 1s, the connect timeout is 2s, and the read timeout is 3s. This kind of error report is that the detection service sends HTTP requests normally, and the server responds normally after receiving the request, but the packet is lost in the network layer-by-layer forwarding, so the execution stack of the request thread stays at the place where the interface response is obtained. A typical feature of this situation is that the corresponding log records can be found on the server. And the log shows that the server response is completely normal. In contrast, the thread stack stays at the Socket connect, which fails when the connection is established, and the server is completely unaware of it.

I noticed that one of the interfaces reported errors more frequently. This API needs to upload a 4m file to the server, then go through a series of business logic processing, and then return 2m text data, while the other interfaces are simple business logic. I guess there may be too much data to be uploaded and downloaded, so timeouts are more likely to cause packet loss.

According to this conjecture, the group logs on to the server and uses the requested request_id to search the recent service log. Sure enough, it is the interface timeout caused by network packet loss.

Of course, leader will not be satisfied with this, and someone has to take over the conclusion. So they quickly contacted the operation and peacekeeping network team to confirm the status of the network at that time. The students in the network group replied that the switch in the computer room where our detection service is located is old, there is an unknown forwarding bottleneck, and is being optimized, which makes me feel more at ease, so I simply explain it in the department group, which is regarded as the completion of the task.

2. Problems break out

I thought there was such a small wave on duty this time, but at more than eight o'clock in the evening, alarm emails from various interfaces swarmed in, and I was so prepared to pack up my things that I was caught off guard on Sunday.

This time, almost all the interfaces are in timeout, while our large number of network Icano interfaces must time out every detection. Is it possible that the whole computer room is malfunctioning?

Once again, I see that the indicators of each interface are normal through the server and monitoring. I tested the next interface and completely OK it. Since it does not affect the online service, I intend to stop the probe task through the API of the probe service and then slowly troubleshoot it.

As a result, I knew it wasn't that simple when I sent a request to the interface that paused the probe task for a long time and didn't respond.

Third, solve the problem

1. Memory leak

So quickly log in to the detection server, first of all, the third company of top free df, the result really found some anomalies.

Our probe process CPU occupancy rate is particularly high, reaching 900%.

Our Java process does not do a lot of CPU operations. Normally, CPU should be between 100% and 200%. When CPU soars, it either goes to an endless loop or is doing a lot of GC.

Use the jstat-gc pid [interval] command to check the GC status of the java process, and sure enough, the FULL GC reaches once per second.

With so many FULL GC, it must be that the memory leak did not run away, so the thread stack site was saved using jstack pid > jstack.log, the heap site was saved with jmap-dump:format=b,file=heap.log pid, and then the detection service was restarted, and the alarm email finally stopped.

Jstat

Jstat is a very powerful JVM monitoring tool, and its general usage is: jstat [- options] pid interval

The view items it supports are:

-class to view class loading information

-compile compilation statistics

-gc garbage collection information

-gcXXX details of GC in each region, such as-gcold

Using it is very helpful in locating memory problems in JVM.

IV. Investigation

Although the problem has been solved, it is still necessary to find out the root cause in order to prevent it from happening again.

1. Analysis stack

Stack analysis is simple to see if there are too many threads and what most stacks are doing.

It is only over 400 threads, and there is no exception.

There seems to be no exception in the thread state, and then analyze the heap file.

Download heap dump files.

The heap files are all binary data, so it is very troublesome to view them on the command line. The tools provided by Java are all visual, but you can't view them on the Linux server, so you have to download the files locally first.

Because we set the heap memory to 4G, the heap file from dump is also very large, so it is really troublesome to download it, but we can compress it first.

Gzip is a very powerful compression command, especially we can set-1 ~-9 to specify its compression level, the larger the data, the greater the compression ratio, the longer the time, it is recommended to use-6x7,-9 is too slow, and the benefit is not great, with this compression time, the extra files can be downloaded.

3. Use MAT to analyze jvm heap

MAT is a powerful tool for analyzing Java heap memory, use it to open our heap file (change the file suffix to .hprof), it will prompt us to analyze the types, for this analysis, decisively choose memory leak suspect.

As can be seen from the pie chart above, most of the heap memory is occupied by the same memory. If you look at the details of the heap memory and trace it back to the upper level, you will soon find the culprit.

4. Analyze the code

Now that you have found the memory leak object, search the project for the object name globally, which is a Bean object, and then navigate to one of its properties of type Map.

This Map uses ArrayList to store the response results of each probe interface according to the type, and each time the probe is completed, it is stuffed into the ArrayList for analysis. Since the Bean object will not be reclaimed and there is no clearing logic for this attribute, the Map becomes larger and larger until the memory is full after 10 days of service.

After the memory is full, the memory can no longer be allocated to the HTTP response result, so it is stuck with readLine all the time. On the other hand, our interface with a large number of Iamp O has a lot of alarms, which is probably related to the fact that the response is too large and requires more memory.

PR is given to code owner, and the problem is solved satisfactorily.

5. Summary

In fact, you should reflect on yourself. at first, there is such a thread stack in the alarm email:

See this kind of error thread stack but did not think about it, to know that TCP can ensure the integrity of the message, and the message has not been received will not be assigned to the variable, this is obviously an internal error, if you pay attention to the problem can be found in advance, check the problem is really not a link ah.

This is the end of the content of "troubleshooting Analysis of Java memory leaks". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.