Example Analysis of Java memory leak troubleshooting 04/24 Update SLTechnology News&Howtos

Example Analysis of Java memory leak troubleshooting

2025-04-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail the example analysis of Java memory leak troubleshooting. The editor thinks it is very practical, so I share it with you for reference. I hope you can get something after reading this article.

On a bleak midnight

Just after midnight, I was woken up by an alarm from the surveillance system. Adventory, an app responsible for indexing ads in our PPC (pay per click) advertising system, has apparently been restarted several times in a row. In a cloud environment, the restart of an instance is normal and does not trigger an alarm, but the number of restarts this time exceeds the threshold in a short period of time. I opened my laptop and dived into the project log.

It must be the network.

I saw that the service timed out several times while connecting to ZooKeeper. We use ZooKeeper (ZK) to coordinate indexing operations among multiple instances and rely on it to achieve robustness. Obviously, a Zookeeper failure will prevent the indexing operation from continuing, but it should not cause the entire system to crash. Moreover, this situation is very rare (this is the first time I have encountered ZK hanging up in a production environment), I think this problem may not be easy to solve. So I woke up the ZooKeeper staff on duty and showed them what was going on.

At the same time, I checked our configuration and found that the timeout for ZooKeeper connections was seconds. Obviously, ZooKeeper is dead, and since other services are using it, this means that the problem is very serious. I sent messages to several other teams, and they obviously didn't know about it.

My colleague on the ZooKeeper team replied that, in his opinion, the system was working fine. Since other users didn't seem to be affected, I came to realize that it wasn't the ZooKeeper problem. It was obvious that the network timed out in the log, so I woke up my colleague in charge of the network.

The team in charge of the network checked their monitoring and found nothing unusual. Since a single network segment, or even a single node, may be disconnected from the remaining nodes, they checked several machines where our system instance is located and found no anomalies. In the meantime, I tried several other ideas, but none of them worked, and I reached the limit of my intelligence. It's getting late (or early), and at the same time, it has nothing to do with my attempt, rebooting has become less frequent. Since this service is only responsible for refreshing the data and will not affect the availability of the data, we decided to leave the problem until the morning.

It must be GC.

Sometimes it's a good idea to put the problem aside, get some sleep, and solve it when your mind is clear. No one knows what happened at that time, and the service was very strange. All of a sudden, I thought of something. What are the main causes of the weird performance of Java services? Garbage collection, of course.

In response to the current situation, we have been printing GC logs. I immediately downloaded the GC log, and then opened Censum to start analyzing the log. Before I took a closer look, I found a terrible situation: full GC happened every 15 minutes, and each GC caused a 20-second service outage. No wonder the connection to ZooKeeper timed out, even though there was nothing wrong with ZooKeeper and the network.

These pauses also explain why the entire service is always dead, rather than typing a single error log after a timeout. Our service runs on Marathon, which periodically checks the health status of each instance, and if an endpoint does not respond for a period of time, Marathon restarts that service.

After knowing the reason, the problem is half solved, so I believe the problem can be solved soon. To explain the following reasoning, I need to explain how Adventory works, unlike your standard microservices.

Adventory is used to index our ads to ElasticSearch (ES). This requires two steps. The first step is to get the data you need. So far, this service has received events sent through Hermes from several other systems. The data is saved to the MongoDB cluster. The maximum amount of data is hundreds of requests per second, and each operation is very light, so even if some memory reclaim is triggered, it does not consume much resources. The second step is to index the data. This operation is performed regularly (about once every two minutes), and all the data stored in the MongoDB cluster is collected into a stream through RxJava, combined into non-normal form records, and sent to ElasticSearch. This part of the operation is similar to an offline batch task rather than a service.

Because you often need to do a lot of updates to the data, it is not worthwhile to maintain the index, so every time you perform a scheduled task, the entire index is rebuilt. This means that a whole piece of data has to pass through the system, which leads to a lot of memory recycling. Despite the streaming approach, we were forced to add the heap to the size of 12 GB. Because the heap is so large (and currently fully supported), our GC chose G1.

In the services I've handled before, I also recycle a large number of objects with a very short life cycle. With that experience, I added the default values of-XX:G1NewSizePercent and-XX:G1MaxNewSizePercent at the same time, so that the new generation will become larger and young GC can process more data without having to send it back to the old age. Censum also shows a lot of premature improvements. This is also consistent with the full GC that happened after a period of time. Unfortunately, these settings didn't make any difference.

Then I thought, maybe producers make data so fast that consumers don't have time to spend, causing these records to be recycled before they are processed. I try to reduce the number of threads in production data and slow down the speed of data generation while keeping the size of the datapool that consumers send to ES the same. This is mainly due to the use of back pressure (backpressure) mechanism, but it does not work.

It must be a memory leak.

At this point, a colleague who was still calm at the time suggested that we should do what we should have done in the first place: check the data in the heap. We have prepared an example of a development environment with the same amount of data as the online instance and roughly the same heap size. By connecting it to the jnisualvm and analyzing the memory samples, we can see the approximate number and size of objects in the heap. At a glance, we can see that the number of Ad objects in our domain is abnormally high, and it has been growing in the process of indexing to the level of the number of ads we are dealing with. But. It shouldn't be. After all, we stream this data through RX to prevent all the data from being loaded into memory.

As the suspicion grew, I checked this part of the code. They were written two years ago and have not been carefully examined since. Sure enough, we actually loaded all the data into memory. Of course it wasn't intentional. Since we didn't have a comprehensive understanding of RxJava at the time, we wanted the code to run in parallel in a special way. In order to separate some of the work from RX's main workflow, we decided to run CompetableFuture with a separate executor. However, we therefore need to wait for all the CompetableFuture to finish working. By storing their references and then calling join (). This causes all references to future, and the data they refer to, to remain alive until the index is complete. This prevents the garbage collector from cleaning them up in time.

Is it really that bad?

Of course, it was a stupid mistake, and we were sick of finding it so late. I even recall that a long time ago, there was a brief discussion about the need for a 12-GB heap for this application. The 12-GB pile is really a little big. On the other hand, the code has been running for nearly two years without any problems. We could have fixed it relatively easily at the time, but if it had been two years ago, it would have taken us more time, and we had a lot more important work than saving a few gigabytes of memory.

Therefore, while it is humiliating from a purely technical point of view that the problem has not been solved for such a long time, from a strategic point of view, it may be a more pragmatic choice to leave this memory-wasting problem alone. Of course, another consideration is what impact this problem will have if it happens. We have hardly had any impact on our users, but the result could be worse. Software engineering is about weighing the pros and cons, and determining the priority of different tasks is no exception.

Still no.

With more experience with RX, we can easily solve the ComplerableFurue problem. Rewrite the code, using only RX; during the rewrite process to upgrade to RX2; for real streaming data, rather than collecting them in memory. After these changes have passed the code review, they are deployed to the development environment for testing. To our surprise, the memory required for the application has not been reduced at all. Memory sampling shows that the number of advertising objects in memory has decreased compared to before. And the number of objects is not always growing and sometimes falling, so they are not all collected in memory. It's still an old problem, and it seems that the data still haven't really been put together.

So what's going on now?

The related keyword has just been mentioned: back pressure. When data is streamed, it is normal for producers and consumers to have different speeds. If producers are faster than consumers and cannot slow down, it will keep producing more and more data, and consumers will not be able to deal with them at the same rate. The phenomenon is that the cache of unprocessed data is growing, and that's what really happens in our applications. Back pressure is a mechanism that allows a slower consumer to tell faster producers to slow down.

There is no concept of back pressure in our index system, which was no problem before, but we saved the entire index in memory anyway. Once we have solved the previous problem and started really streaming the data, the lack of back pressure becomes obvious.

I've seen this pattern many times when solving performance problems: solving one problem will give rise to another problem you haven't even heard of, because other problems hide it. If your house is often flooded, you won't notice that it has a fire hazard.

Fix problems caused by repair

In RxJava 2, the original Observable class was split into Observable that does not support back pressure and Flowable that supports back pressure. Fortunately, there are some simple ways to transform an Observable that does not support back pressure into a Flowable that supports back pressure right out of the box. It includes creating Flowable from non-responsive resources such as Iterable. The combination of these Flowable can generate Flowable that also supports back pressure, so as long as you quickly solve one point, the whole system has back pressure support.

With this change, we reduced the heap from 12 GB to 3 GB while keeping the system at the same speed. We still have a 2-second full GC pause every few hours, but this is much better than the 20-second pause (and system crash) we've seen before.

Optimize GC again

But the story is not over yet. Checking GC's log, we noticed a large number of premature improvements, accounting for 70%. Although the performance is acceptable, we also try to solve this problem, hoping that we can solve the full GC problem at the same time.

If the life cycle of an object is very short, but it is still promoted to the old age, we call this phenomenon premature tenuring (or premature upgrade). Objects in the old era are usually large, using a different GC algorithm from the new generation, and these prematurely promoted objects occupy the space of the old era, so they will affect the performance of GC. Therefore, we want to try our best to avoid premature promotion.

Our application produces a large number of short-life-cycle objects in the indexing process, so some premature promotions are normal, but they should not be so serious. When an application produces a large number of objects with a short life cycle, the first thing you can think of is to simply increase the space of the new generation. By default, the GC of G1 automatically adjusts the space of the new generation, allowing the new generation to use 5% to 60% of the heap memory. I noticed that the ratio of the new generation to the old has been changing in a wide range in the running applications, but I still started to modify two parameters:-XX:G1NewSizePercent=40 and-XX:G1MaxNewSizePercent=90 to see what happens. This didn't work and even made things worse, triggering full GC as soon as the app started. I've tried other proportions, but the best-case scenario is to only increase the G1MaxNewSizePercent without changing the minimum. This worked, probably about the same as the default value, but not for the better.

After trying a lot of things, I didn't achieve much, so I gave up and sent an email to Kirk Pepperdine. He is a very well-known Java performance expert, and I happened to meet him in the training course of the Devoxx conference held by Allegro. By checking GC's log and communicating with a few emails, Kirk recommends trying the setting-XX:G1MixedGCLiveThresholdPercent=100. This setting should force G1 GC to clean up all old ages, regardless of how much they are populated, during mixed GC, thus cleaning up objects that ascend prematurely from the new generation. This should prevent the old age from being filled up and produce a full GC. However, after running for a while, we were surprised to find full GC again. Kirk reasoned that he had seen this in other apps, where it was a bug:mixed GC of G1 GC that apparently didn't clean up all the garbage, leaving it to pile up until full GC was generated. He said he had notified Oracle of the problem, but they insisted that the phenomenon we observed was not a bug, but normal.

Conclusion

The last thing we did was to increase the memory of the application a little bit (from 3 GB to 4 GB), and then full GC disappeared. We still see a lot of premature improvements, but since the performance is fine, we don't care about it. One option we can try is to switch to GMS (Concurrent Mark Sweep) GC, but since it is obsolete, we try not to use it.

So what is the moral of this story? First of all, performance problems can easily lead you astray. At first it seemed like a problem with ZooKeeper or the network, but in the end it turned out to be our code. Even if I was aware of this, the first measures I took were not well considered. To prevent full GC, I started tuning GC before checking what was going on. This is a common trap, so remember: even if you have an intuition to do something, check what's going on first, and then check it again to avoid wasting time on the wrong problem.

Second, performance problems are too difficult to solve. Our code has good test coverage and runs very well, but it does not meet the performance requirements, and it is not clearly defined at the beginning. Performance problems did not surface until long after deployment. Because it is often difficult to truly reproduce your production environment, you are often forced to test performance in a production environment, even if it sounds very bad.

Third, solving one problem may lead to the emergence of another potential problem, forcing you to keep digging deeper than you expected. The fact that we don't have back pressure is enough to interrupt the system, but it didn't emerge until we solved the memory leak.

This is the end of this article on "sample Analysis of Java memory leak troubleshooting". I hope the above content can be helpful to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.