Troubleshooting ideas for slow operation of the system, CPU 100%, and excessive Full GC times 07/04 Update SLTechnology News&Howtos

Troubleshooting ideas for slow operation of the system, CPU 100%, and excessive Full GC times

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

Source: http://t.cn/EI9JdBu

Students who have dealt with online problems will basically encounter problems such as sudden slow operation of the system, 100% CPU, and too many Full GC times. Of course, the intuitive phenomenon caused by these problems is that the system runs slowly and there are a large number of alarms. Aiming at the problem of slow operation of the system, this paper provides the idea of troubleshooting the problem, so as to locate the code point of the problem, and then provide the idea to solve the problem.

For the sudden slow running problem of the online system, if the problem causes the online system to be unavailable, the first thing you need to do is to export jstack and memory information, and then restart the system to ensure the availability of the system as soon as possible. There are two main possible reasons for this:

A large amount of data is read at a certain point in the code, which causes the system to run out of memory, resulting in too many Full GC times and the system is slow.

There are more CPU-consuming operations in the code, resulting in too high CPU and slow system operation.

Relatively speaking, these are the two most frequent online problems, and they directly cause the system to become unavailable. There are several other situations that can cause a feature to run slowly, but not to make the system unavailable:

There is an blocking operation at some point in the code, which makes it time-consuming to call the function as a whole, but the occurrence is relatively random.

A thread enters the WAITING state for some reason, and the feature is not available as a whole, but cannot be reproduced

Due to the improper use of locks, multiple threads enter a deadlock state, resulting in a slow system as a whole.

For these three cases, it is impossible to find out the specific problems by looking at CPU and system memory, because they all have some blocking operations, CPU and system memory usage are not high, but the function is very slow. Let's identify the above problems step by step by looking at the system log.

1. Too many times of Full GC

Relatively speaking, this situation is the most likely to happen, especially when new features are online. In the case of more Full GC, it mainly has the following two characteristics:

The CPU of multiple threads online is more than 100%. Through the jstack command, we can see that these threads are mainly garbage collection threads.

By monitoring GC through the jstat command, you can see that the number of Full GC is very high, and the number is increasing.

First of all, we can use the top command to check the usage of the system CPU. Here is an example of a higher system CPU:

Top-08:31:10 up 30 min, 0 users, load average: 0.73,0.58,0.34

KiB Mem: 2046460 total, 1923864 used, 122596 free, 14388 buffers

KiB Swap: 1048572 total, 0 used, 1048572 free. 1192352 cached Mem

PID USER PR NI VIRT RES SHR S CPU MEM TIME+ COMMAND

9 root 20 0 2557160 288976 15812 S 98.0 14.1 0:42.60 java

You can see that there is a Java program whose CPU usage reaches 98.8%. At this time, we can copy the process id 9 and use the following command to check the running status of each thread of the process:

Top-Hp 9

The threads under this process run as follows:

Top-08:31:16 up 30 min, 0 users, load average: 0.75,0.59,0.35

Threads: 11 total, 1 running, 10 sleeping, 0 stopped, 0 zombie

% Cpu (s): 3.5 us, 0.6 sy, 0.0 ni, 95.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

KiB Mem: 2046460 total, 1924856 used, 121604 free, 14396 buffers

KiB Swap: 1048572 total, 0 used, 1048572 free. 1192532 cached Mem

PID USER PR NI VIRT RES SHR S CPU MEM TIME+ COMMAND

10 root 20 0 2557160 289824 15872 R 79.3 14.2 0:41.49 java

11 root 20 0 2557160 289824 15872 S 13.2 14.2 0:06.78 java

You can see the CPU usage of each thread in a Java program with process 9, and then we can use the jstack command to see why a thread with an id of 10 consumes the most CPU. It should be noted that in the results shown by the jsatck command, the thread id is converted to hexadecimal form. You can view the conversion results with the following command, or you can find a scientific calculator to convert:

Root@a39de7e7934b:/# printf "% x\ n" 10a

The printed result here shows that the thread is displayed as 0xa in jstack. Through the jstack command, we can see the following information:

"main" # 1 prio=5 os_prio=0 tid=0x00007f8718009800 nid=0xb runnable [0x00007f871fe41000]

Java.lang.Thread.State: RUNNABLE

At com.aibaobei.chapter2.eg2.UserDemo.main (UserDemo.java:9)

"VM Thread" os_prio=0 tid=0x00007f871806e000 nid=0xa runnable

Here the VM Thread line ends with nid=0xa, where nid means operating system thread id. VM Thread refers to the thread of garbage collection. Here we can basically confirm that the main reason for the slowness of the current system is that garbage collection is too frequent, resulting in a long GC pause. We can view the situation of GC with the following command:

Root@8d36124607a0:/# jstat-gcutil 9 1000 10 S0 S1 E O M CCS YGC YGCT FGC FGCT GCT 0.00 0.00 75.07 59.09 59.60 3259 0.919 6517 7.715 8.635 0.00 0.00 0.08 59.09 59.60 3306 0.930 7.822 8.752 0.00 . 00 0.08 59.09 59.60 3351 0.943 6701 7.924 8.867 0.00 0.00 0.00 0.08 59.09 59.60 3397 0.955 6793 8.029 8.984

As you can see, FGC here refers to the number of Full GC, which is as high as 6793, and it is growing. Thus it is further confirmed that the system is slow due to memory overflow. So the memory overflow is confirmed here, but how to find out which objects are causing the memory overflow? you can dump the memory log and view it through eclipse's mat tool. The following shows an object tree structure:

After the analysis of the mat tool, we can basically determine which object in memory consumes more memory, and then find the location where the object is created and deal with it. PrintStream is the most common here, but we can also see that its memory consumption is only 12.2%. That is, it's not enough to cause a lot of Full GC, so we need to consider another case where there is a System.gc () call shown in the code or in the package that the third party depends on. In this case, we can tell by looking at the file obtained by dump memory, because it will print GC reasons:

[Full GC (System.gc ()) [Tenured: 262546K (349568K), 0.0014879 secs] 262546K-> 262546K (506816K), [Metaspace: 3109K-> 3109K (1056768K)], 0.0015151 secs] [Times: user=0.00 sys=0.00, real=0.01 secs] [GC (Allocation Failure) [DefNew: 2795K-> 0K (157248K), 0.0001504 secs] [Tenured: 262546K-> 402K (349568K), 0.0012949 secs] 265342K-> 402K (506816K), [Metaspace: 3109K-> 3109K (1056768K)] 0.0014699 secs] [Times: user=0.00

For example, here the first GC is caused by a display call to System.gc (), while the second GC is initiated by JVM. To sum up, there are two main reasons for too many Full GC times:

A large number of objects are acquired in the code at one time, resulting in a memory overflow. At this time, you can check which objects are in memory through eclipse's mat tool.

The memory footprint is not high, but the number of Full GC is still high. At this time, it is possible that the displayed System.gc () call causes too many GC times, which can be disabled by adding-XX:+DisableExplicitGC to disable the response of JVM to the display GC.

2. CPU is too high

In the first point above, we mentioned that the high CPU may be due to the frequent Full GC of the system, causing the system to be slow. And we are usually willing to encounter more time-consuming computing, resulting in the CPU is too high, at this time the view method is actually very similar to the above. First of all, we use the top command to check which process is currently consuming too much CPU, so as to get the process id;, and then use top-Hp to see which threads in the process have too high CPU. Generally, more than 80% is relatively high, and about 80% is reasonable. In this way, we can get the thread id with high CPU consumption. Then view the specific stack information of the current thread in the jstack log through the hexadecimal representation of the thread id.

Here we can tell whether the reason for the high CPU is the excessive number of Full GC or the more time-consuming calculations in the code. If there are too many Full GC times, then the thread information obtained through jstack will be similar to a thread like VM Thread, while if there is a more time-consuming calculation in the code, then what we get is the specific stack information of a thread. The following is a time-consuming calculation in the code that causes the thread information to be too high in CPU:

Here you can see that when the UserController is requested, the thread's CPU is always at 100% because the Controller makes a time-consuming call. Based on the stack information, we can navigate directly to line 34 of UserController to see what causes such a high amount of computation in the code.

3. Irregular interface time-consuming phenomenon

In this case, a typical example is that it often takes 2 seconds for one of our interfaces to return. This is a troublesome situation, because generally speaking, it does not consume much CPU and does not take up a lot of memory, that is to say, we cannot solve this problem by troubleshooting in the above two ways. And because the time-consuming problems of such interfaces occur from time to time, even if we get the stack information accessed by threads through the jstack command, we are unable to judge which thread is performing a more time-consuming operation.

For the time-consuming problem of the interface that occurs irregularly, our positioning idea is basically as follows: first, find the interface and continuously increase the access efforts through the stress testing tool. If there is a certain place in the interface that is time-consuming, because our access frequency is very high, then most threads will eventually block at the blocking point, so that multiple threads will have the same stack log. We can basically locate the more time-consuming code in the interface. The following is a thread stack log obtained by the stress test tool with a relatively time-consuming blocking operation in the code:

"http-nio-8080-exec-2" # 29 daemon prio=5 os_prio=31 tid=0x00007fd08cb26000 nid=0x9603 waiting on condition [0x00007000031d5000]

Java.lang.Thread.State: TIMED_WAITING (sleeping)

At java.lang.Thread.sleep (Native Method)

At java.lang.Thread.sleep (Thread.java:340)

At java.util.concurrent.TimeUnit.sleep (TimeUnit.java:386)

At com.aibaobei.user.controller.UserController.detail (UserController.java:18)

"http-nio-8080-exec-3" # 30 daemon prio=5 os_prio=31 tid=0x00007fd08cb27000 nid=0x6203 waiting on condition [0x00007000032d8000]

Java.lang.Thread.State: TIMED_WAITING (sleeping)

At java.lang.Thread.sleep (Native Method)

At java.lang.Thread.sleep (Thread.java:340)

At java.util.concurrent.TimeUnit.sleep (TimeUnit.java:386)

At com.aibaobei.user.controller.UserController.detail (UserController.java:18)

"http-nio-8080-exec-4" # 31 daemon prio=5 os_prio=31 tid=0x00007fd08d0fa000 nid=0x6403 waiting on condition [0x00007000033db000]

Java.lang.Thread.State: TIMED_WAITING (sleeping)

At java.lang.Thread.sleep (Native Method)

At java.lang.Thread.sleep (Thread.java:340)

At java.util.concurrent.TimeUnit.sleep (TimeUnit.java:386)

At com.aibaobei.user.controller.UserController.detail (UserController.java:18)

As you can see from the log above, there are several threads blocked on line 18 of UserController, indicating that this is a blocking point, which is the reason why the interface is slow.

4. A thread enters the WAITING state

For this kind of situation, this is a relatively rare situation, but it is also possible, and because of its "non-repeatability", it is very difficult for us to find out when we investigate. The author has encountered a similar situation. The specific scenario is that when using CountDownLatch, the main thread will not be awakened to execute until every parallel task is completed. At that time, we controlled multiple threads to connect and export the user's gmail mailbox data through CountDownLatch, and one of them connected to the user's mailbox, but the connection was suspended by the server, causing the thread to wait for a response from the server. Eventually, our main thread and several other threads are in the WAITING state.

For such a problem, readers who have looked at the jstack log should know that normally, most threads online are in the TIMED_WAITING state, while the thread in question here is in exactly the same state, which is very easy to confuse our judgment. The main ideas to solve this problem are as follows:

Find all the threads in the TIMED_WAITING state in the jstack log through grep and export them to a file, such as a1.log. The following is an example of an exported log file:

"Attach Listener" # 13 daemon prio=9 os_prio=31 tid=0x00007fe690064000 nid=0xd07 waiting on condition [0x0000000000000000] "DestroyJavaVM" # 12 prio=5 os_prio=31 tid=0x00007fe690066000 nid=0x2603 waiting on condition [0x0000000000000000] "Thread-0" # 11 prio=5 os_prio=31 tid=0x00007fe690065000 nid=0x5a03 waiting on condition [0x0000700003ad4000] "C1 CompilerThread3" # 9 daemon prio=9 os_prio=31 tid=0x00007fe68c00a000 nid=0xa903 waiting on condition [0x0000000000000000]

After waiting for a period of time, such as 10s, grep the jstack log again and export it to another file, such as a2.log. The result is as follows:

"DestroyJavaVM" # 12 prio=5 os_prio=31 tid=0x00007fe690066000 nid=0x2603 waiting on condition [0x0000000000000000] "Thread-0" # 11 prio=5 os_prio=31 tid=0x00007fe690065000 nid=0x5a03 waiting on condition [0x0000700003ad4000] "VM Periodic Task Thread" os_prio=31 tid=0x00007fe68d114000 nid=0xa803 waiting on condition

Repeat step 2. After exporting 3 or 4 files, we compare the exported files to find out the user threads that have always existed in these files. This thread can basically confirm that it contains a thread with a problem in the waiting state. Because the normal request thread will not wait after 20pm 30s.

After troubleshooting these threads, we can continue to troubleshoot their stack information, if the thread itself should be in a waiting state, such as a thread in an idle state in a user-created thread pool, then the stack information of this thread will not contain user-defined classes. All of this can be ruled out, and the remaining threads can basically confirm that they are the problematic thread we are looking for. From its stack information, we can figure out exactly where the code caused the thread to wait.

What needs to be noted here is that when we judge whether it is a user thread, we can judge by the thread name at the front of the thread, because the thread naming of the general framework is very standardized. We can directly judge from the thread name that the thread is a thread in some framework, which can basically be excluded. The rest, such as the Thread-0 above, and the custom thread names that we can identify, are the objects we need to troubleshoot.

After troubleshooting in the above way, we can basically conclude that the Thread-0 here is the thread we are looking for, and by looking at its stack information, we can find out exactly where it is causing it to wait. In the following example, line 8 of SyncTask causes the thread to wait.

Thread-0 # 11 prio=5 os_prio=31 tid=0x00007f9de08c7000 nid=0x5603 waiting on condition [0x0000700001f89000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park (Native Method) at java.util.concurrent.locks.LockSupport.park (LockSupport.java:304) at com.aibaobei.chapter2.eg4.SyncTask.lambda$main$0 (SyncTask.java:8) at com.aibaobei.chapter2.eg4.SyncTask$$Lambda$1/1791741888.run (Unknown Source) at java.lang.Thread.run (Thread.java:748) 5. Deadlock

For deadlocks, this is basically easy to detect, because jstack can help us check for deadlocks and print specific deadlock thread information in the log. The following is an example of a jstack log that produces a deadlock:

As you can see, at the bottom of the jstack log, it directly helps us analyze which deadlocks exist in the log, as well as the thread stack information for each deadlock. Here we have two user threads waiting for each other to release the lock, and the blocked location is on line 5 of ConnectTask, so we can directly locate that location and do code analysis to find out the cause of the deadlock.

6. Summary

This paper mainly explains the five situations that may lead to the slowness of the system online, and analyzes in detail the phenomenon when each situation arises. According to the phenomenon, we can locate the slowness of the system caused by this reason. To put it briefly, when we conduct online log analysis, it can be divided into the following steps:

Check the CPU through the top command, and if the CPU is high, check the running status of each thread of the current process through the top-Hp command, find out the thread with too high CPU, convert its thread id to the hexadecimal representation, and then check the main work that the thread is doing in the jstack log. There are two situations here.

If it is a normal user thread, use the stack information of the thread to see where the user code is running and consume more CPU

If the thread is VMThread, monitor the GC status of the current system through the jstat-gcutil command, and then export the current memory data of the system through jmapdump:format=b,file=. After export, put the memory situation into the mat tool of eclipse for analysis, and you can find out which objects in memory consume more memory, and then you can deal with the relevant code.

If you see through the top command that the CPU is not high, and the system memory usage is relatively low. At this point, you can consider whether the problem is caused by the other three situations. Specific analysis can be made according to the specific situation:

If the interface calls are time-consuming and occur irregularly, you can increase the frequency of blocking points by pressure testing, so as to check the stack information through jstack and find the blocking points.

If there is a sudden stagnation of a function, this situation cannot be repeated. At this time, you can compare which user threads have been waiting by exporting the jstack log several times. These threads are the threads that may have problems.

If you can see the deadlock status through jstack, you can check the specific blocking points of the two threads that caused the deadlock and deal with the problem accordingly.

This paper mainly puts forward five common problems that lead to slow online function, as well as troubleshooting ideas. Of course, online problems appear in a variety of forms, and are not necessarily limited to these situations. If we can carefully analyze the scenarios of these problems, we can analyze them according to the specific situation, so as to solve the corresponding problems.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.