Summarize the positioning process of the problem that CPU occupies 1600%. 07/09 Update SLTechnology News&Howtos

Summarize the positioning process of the problem that CPU occupies 1600%.

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly explains "summing up the positioning process of the problem that CPU occupies 1600%". The content of the explanation in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "summarize the positioning process of the problem that CPU occupies 1600%".

cause

After a slightly larger revision, the system went online, no problem was found in the online test, and the feedback system was stuttered and offline the next day.

Check the system problems, optimize the interface speed online, found that there is no problem, the next day still stutter. At this time, it is observed that CPU occupies 1600%. Roll back when you think of it at this time. Didn't keep the scene.

The test environment tests and finds that cpu takes up 100% to find the problem fix when it is idle. But it is certain that 100% is not the cause of 1600%.

Go online again and monitor cpu occupancy manually in real time. At this time, 1600% of the cases occurred. At this point, the corresponding threads for processes that occupy 1600% of cpu are as follows.

6619, 6625, etc., are the most occupied process Id.

The JVM stack information is printed and output to a file.

6619 is converted to hexadecimal for 19db. Search the stack file according to the process number as shown below.

Finally, it is found that all the processes that occupy high cpu are gc processes, which can be determined at this time. Some of the code logic takes up too much memory. Or there is a memory leak.

Look for a problem

At this time, it has failed three times in a row, and there is no way to test it online. So the idea is to simulate this phenomenon in a grayscale environment and then dump the heap information so that you can definitely find out why.

On the first day, a small amount of other system traffic and a small part of user traffic were transferred, and no problems were found.

On the second day, the traffic requested by other systems remained unchanged, more user traffic was added, and there were no recurrence problems.

Increase the traffic requested by some other systems on the third day, and there is no recurrence of the problem.

On the nth day to increase the traffic requested by other systems, the memory adjustment is small, and there is no recurrence problem.

On the n + 1 day, the grayscale environment service divides the traffic equally with the formal environment, and continues to increase the number of users. There's no problem.

At this time, the modified part of the core process code has been checked n times and no problem has been found.

Then you need to think about why the grayscale environment is not a problem. And there's something wrong with the line. What is the difference between their users?

At this time, it is found that the grayscale environment is all the users with the lowest privileges, and the administrator does not work on the grayscale environment, thinking that the problem is very close to the truth. It can be said that the problem has been located, just need to verify their own conjecture.

One of the functions, which is to view the data of the person you manage, is not considered in this direction at first because it is not the core function and the number of requests is very small.

The logic is: find their next level, if there is data, continue to search, it happens that the database has an abnormal data, his next level is himself! Resulting in a dead loop, resulting in more and more data in memory.

And only that abnormal user will cause this problem!

And because it is an IO-intensive operation, this loop takes up a very low cpu. He was not found in the thread stack.

Solve the problem

It is easy to find a solution to the problem and will not describe it in detail.

Introspection

The first time a stutter occurs in the system, the correct way to handle it should be as follows

It is found that the cpu occupancy is high. Check what the thread corresponding to the process is doing.

It is found that a large number of threads are performing gc operations, and the heap information should be dump at this time.

Use tools such as jmap to see which objects have a high memory footprint

Find the corresponding code to solve the problem

This kind of bug should not exist, even if there is a problem, do not panic should quickly retain the information that can be retained.

Big changes need to be released in grayscale before going online, and a small number of users will use them first.

Thank you for your reading, the above is the content of "summing up the positioning process of the problem that CPU occupies 1600%". After the study of this article, I believe you have a deeper understanding of the positioning process of summarizing the 1600% problem of CPU, and the specific use still needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.