Case Analysis of outage accident caused by JVM FullGC 07/03 Update SLTechnology News&Howtos

Case Analysis of outage accident caused by JVM FullGC

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article introduces the case analysis of downtime caused by JVM FullGC, the content is very detailed, interested friends can refer to, I hope it can be helpful to you.

Business scenario introduction

First of all, briefly talk about a background of the online production system, because only the article as a case, so weaken a lot of business background.

To put it simply, this is a distributed system, system A needs to transfer a very core and critical data through the network request to another system B.

In fact, a problem is considered here: if system A has just passed the core data to system B, and system B is inexplicably down, will it not lead to data loss?

Therefore, a very classical Quorum algorithm is adopted in the architecture design of this distributed system.

To put it simply, system B must deploy an odd number of nodes, such as at least three machines, or five machines, seven machines, and so on.

Then every time system A transmits a data to the system, it must send a request to all the machines deployed by system B to transfer a copy of the data to all the machines deployed by system B.

To determine whether the first data write of system A to system B is successful, system A must successfully transmit to the machine of system B that exceeds the number of Quorum within a specified time range.

For example, suppose system B deploys three machines, then his number of Quorum is: 3 / 2 + 1 = 2, which means that the number of Quorum of system B is: all machines / 2 + 1.

Therefore, system A must determine whether a core data is written successfully, and if system B has deployed a total of three machines, then system A must receive a successful response from the two machines where system B is located within a specified time.

At this point, system A can think that this data has been written successfully to system B. This is called the Quorum mechanism.

In other words, under the distributed architecture, data is transferred between systems. If one system wants to ensure that the data it transmits to another system will not be lost, it must receive (mostly) the number of machine responses from another system within a specified period of time.

In fact, this mechanism is widely used in many distributed systems and middleware systems. Our online distributed system also uses this Quorum mechanism to transfer data between the two systems.

Let's give you a picture and take a look at what this structure looks like:

As shown in the figure above, the Quorum mechanism for transferring a copy of data between system An and system B is clearly shown.

Next, let's use code to show you what the above Quorum writing mechanism looks like at the code level.

PS: because the actual mechanism involves a lot of underlying network transmission, communication, fault tolerance, and optimization, the following code has been greatly simplified to express only a core meaning.

The above is the substantially streamlined code, but the core is clear. You can read it carefully twice, in fact, it is very easy to understand.

The meaning of this code is very simple. To put it bluntly, it is to start the thread asynchronously to send data to all the machines in system B, and enter a while loop to wait for the number of Quorum machines in system B to return the response result.

If the expected number of machines have not been returned after the specified timeout, it is determined that the cluster deployed by system B has failed, and then let system An exit directly, which is equivalent to system A downtime.

That's what the whole code means!

The problem is highlighted.

It's not hard just to look at the code, but the problem is that running online is not as easy as you think when you write the code.

Once during the operation of the online production system, the overall system load was very stable, there should not be any problem, but the result suddenly received an alarm that system A suddenly shut down.

Then began to conduct a troubleshooting, left and right troubleshooting, found that the system B clusters are good, there should be no problems.

Then check system An and find that there is nothing wrong with system An elsewhere. * combining the log of system An and the log of system A's JVM FullGC for garbage collection, we can figure out the specific cause of the failure.

Positioning problem

In fact, the reason is very simple, that is, system A will occasionally carry out JVM FullGC of Stop the World for a long time, that is, large area garbage collection, after running online for a period of time.

However, at this time, it will cause a large number of stutters in the worker threads within system An and will no longer work. The worker thread will not resume operation until the end of the JVM FullGC.

Let's look at the following code snippet:

However, the inexplicable downtime of system An is incorrect, because without JVM FullGC, the above if statement would not have been established.

It will pause for 1 second to enter the next while loop, and then it will receive the result of the number of Quorum returned by system B. the while loop can be interrupted and continue to run.

As a result, due to the emergence of JVM FullGC stutters for dozens of seconds, inexplicably triggered the execution of if judgment, system An inexplicably exited downtime.

Therefore, the system stutter caused by online JVM FullGC for a long time is really one of the invisible killers that cause the system to run unsteadily.

Solve the problem

As for the optimization of the above code stability, it is also very simple. We just need to add something to the code to monitor whether JVM FullGC has occurred in the above code.

If a JVM FullGC occurs, you can automatically extend the expireTime.

For example, the following code improvements:

Through the improvement of the above code, we can effectively optimize the stability of the online system to ensure that in the case of JVM FullGC, there will not be abnormal downtime and exit at will.

This is the end of the case analysis of downtime caused by JVM FullGC. I hope the above content can be helpful to you and learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.