How to troubleshoot CPU and Load anomalies in linux 07/15 Update SLTechnology News&Howtos

How to troubleshoot CPU and Load anomalies in linux

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces how to troubleshoot the abnormal problems of CPU and Load in linux, which has a certain reference value. Interested friends can refer to it. I hope you can learn a lot after reading this article.

1. Top command

Now that we talk about cpu and load, we always need monitoring. Without monitoring, we don't know cpu and load, and the rest is out of the question.

The top command is the most common command to check cpu and load. Use the ubuntu system installed on my virtual machine to execute the top command (the default is to swipe once every 3 seconds, and-d can specify the refresh time):

A table is made to explain the meaning of each part in detail, in which the important attributes are marked in red and bold:

The memory and SWAP output format are the same, so they are written together.

2. How to calculate cpu

When we execute the top command, we see that the values (mainly cpu and load) change all the time, so it is necessary to take a brief look at how cpu is calculated in the Linux system.

Cpu is divided into system cpu and process and thread cpu. The statistical value of system cpu is under / proc/stat (the following screenshot is not complete):

Cpu, cpu0 after these numbers are in front of us, sy, ni these corresponding, which corresponding to which value is not important, interested can look up the document online.

The statistics of the process cpu are located under / proc/ {pid} / stat:

The statistics of thread cpu are located under / proc/ {pid} / task/ {threadId} / stat:

All the values in this are values from the system startup time to the current time. Therefore, the practice for cpu calculation is to sample two sufficiently short times T1 and T2:

Sum all the cpu usage of T1 to get S1

Sum all the cpu usage of T2 to get S2

S2-S1 gets all the time totalCpuTime in this interval

Idle idle1 of the first time-idle idle2 of the second time to get the idle time in the sampling time

Cpu usage = 100 * (totalCpuTime-idle) / totalCpuTime.

Other times such as us, sy, and ni are all calculated in a similar way. to sum up, the value of cpu reflects the cpu usage during a certain sampling time. So sometimes the cpu is very high, but when the print thread comes out of the stack, it should not be surprised to find that the thread with high cpu is waiting to query the database, because cpu counts the data during the sampling time.

Assuming that top observes that the user space cpu has been high for a certain period of time, it means that the user's program has been occupying cpu to do things during that time.

Third, the understanding of load

With regard to the meaning of load, in fact, some articles associate it with driving across the bridge is more appropriate and easy to understand:

A single-core processor can be vividly likened to a single lane, in which vehicles drive in turn, and only after the car passes by can the car drive.

If there is no traffic in front, then you can pass smoothly; if there are many cars, you need to wait for the car in front of you to pass.

Therefore, some specific codes are needed to indicate the current traffic flow, such as:

Equal to 0.00, which means that there is no traffic on the bridge deck at present. In fact, this situation is the same between 0.00 and 1.00. All in all, it is very smooth, and passing vehicles can pass without waiting at all.

Equal to 1.00, which means it is just within the bearing range of the bridge. The situation is not bad, but there will be some traffic jams, but it may cause traffic to get slower and slower.

If it is greater than 1.00, it means that the bridge has been overloaded and the traffic is seriously congested. So how bad is it? For example, the case of 2.00 shows that the traffic flow has exceeded twice the capacity of the bridge, then there will be more than twice as many vehicles waiting anxiously.

But after all, a metaphor is a metaphor. From the metaphor, we know that load represents a capability of the system, but we do not know what kind of tasks will be classified into the calculation of load. For more information about what kind of tasks will be included in load's calculation, you can use the man uptime command to take a look at Linux's explanation of load:

Roughly speaking, the system load is the average number of processes in a running or uninterruptible state (the red part indicates what is counted in the load). A running process indicates that it is using cpu or is waiting to use cpu, and an uninterruptible process indicates that it is waiting for IO, such as disk IO. The average value of load is shown in 3 time intervals, that is, we see 1 minute, 5 minutes and 15 minutes. The load value is related to the number of cpu cores. The load=1 of single-core cpu indicates that the system is in a state of load all the time, but the load=1 of 4-core cpu indicates that the system is 75% idle.

In particular, load refers to the average of all cores, which is different from the value of cpu.

Another important point is that after looking up the data, it is found that although the above has been emphasized as "process", the number of threads in the process will also be calculated as different processes. If a process produces 1000 threads running at the same time, then the length of the running queue is 1000 gravity load average is 1000.

Fourth, the relationship between the number of requests and load

I have always had a misunderstanding: when thousands of requests come and are queued up, the load value is bound to rise when subsequent requests are not processed. After careful consideration, this view is really wrong, so write it as a paragraph and share it with you.

Take Redis as an example, we all know that Redis is a single-threaded model, which means that there can be countless requests coming at a time, but only one command will be processed at a time.

Photo Source: https://www.processon.com/view/5c2ddab0e4b0fa03ce89d14f

After a separate thread receives a ready command, it transfers the command to the event dispatcher, which executes the corresponding command processing logic according to the type of command. Because there is only one thread, load performance equals 1 as long as there are enough commands queued to allow the thread to process commands one after another.

Throughout the process, looking back at the value of load, it has nothing to do with the number of requests, what is really related to load is the number of worker threads, mainthread is a worker thread, Timer is a worker thread, GC thread is also a worker thread, load is based on thread / process as a statistical indicator, no matter how many requests are, it eventually needs threads to process, and the processing performance of the worker thread directly determines the final load value.

For example, suppose there is a thread pool in a service, and the number of threads in the thread pool is fixed at 64:

Normally, the execution time of a task is 10ms, and the thread gets the task 10ms to finish processing, and soon returns to the thread pool to wait for the next task. Naturally, there are very few threads running or waiting for IO. From a statistical cycle, the load is very low.

For a certain period of time, due to system problems, a task cannot be processed for 10 seconds, which means that the thread has been processing the task. The value reflected in the statistical cycle of load is = 64 (regardless of the scenarios outside these 64 threads).

Therefore, in a word, it is very important to understand the relationship between load value and the number of requests and threads, so that we can proceed to the next step correctly.

Fifth, the thinking of troubleshooting problems with high load and high cpu.

First of all, this paper puts forward a point of view: the high cpu is not the problem, but the high load caused by the high cpu is the problem, and load is the basis for judging the system capability index.

Why do you say so? take single-core cpu as an example, when our daily cpu is 20% or 30%, it is actually a waste of cpu resources, which means that most of the time cpu is not doing anything. Theoretically, a system limit cpu utilization can reach 100%, which means that cpu is completely used to handle computing-intensive tasks, such as for loops, md5 encryption, new objects, and so on. But this is not really possible, because it is almost impossible for an application to have no IO that does not consume cpu, such as reading a database or reading files, so cpu is not as high as possible, which is usually an empirical value that needs to be alerted.

Note that the above mentioned "cause alert" means that the high cpu is not necessarily a problem, but you need to take a look, especially during the daily period, because the daily traffic is usually small, and it is impossible to hit such a high cpu. If only ordinary code is really dealing with normal business, then no problem. If there is an endless loop in the code (for example, the problem caused by the classic HashMap expansion in JDK1.7), then several threads have been occupying the cpu, which will eventually lead to an increase in load.

In a Java application, the idea of troubleshooting high cpu is usually relatively simple and has a relatively fixed practice:

Ps-ef | grep java, which is used to query the process pid of Java application.

Top-H-p pid, query occupies the highest thread pid in cpu

Convert a decimal thread pid to a hexadecimal thread pid, such as 2000=0x7d0

Jstack process pid | grep-A 20 '0x7d0thread, find the thread that matches the nid, check the stack, and locate the cause of the high CPU.

There are many articles on the Internet that stop writing here, but this is not the case in practice. Because cpu is a statistical value over a period of time, and jstack is an instantaneous stack that records only the instantaneous state, the two are not things of the same dimension at all, so it is entirely possible to see the code in the following places from the printed stack line number:

Network IO that does not consume cpu

For (int I = 0, size = list.size (); I < size; iTunes +) {...}

Call the native method.

If you completely follow the above set of steps, you will be dumbfounded in this situation. You can't figure it out for a long time, and you don't understand why this kind of code leads to high cpu. In view of this possible situation, when actually troubleshooting the problem, jstack recommends printing 5 times or at least 3 times. According to the contents of the stack and the analysis of the relevant code segments, locate the cause of the high cpu. The high cpu may be caused by a bug in the code segment rather than the lines printed out by the stack.

In addition, there is another possible reason for the high cpu. If we see that the total cpu of a 4-core cpu server reaches 100% bandwidth, press 1 to observe the us of each cpu, only one reaches 90% bandwidth, and the others are all about 1% (the following figure only shows that the effect of top pressing 1 is not a real scenario):

In this case, we can focus on whether it is caused by frequent FullGC. Because we know that there will be Stop The World action during FullGC, multicore cpu servers, except GC threads, will hang during Stop The World until the end of Stop The World. Take several old garbage collectors as examples:

Serial Old collector, full Stop The World

Parallel Old collector, full Stop The World

CMS collector, in the process of initial marking and concurrent marking, will Stop The World in order to accurately mark the objects that need to be recycled, but the system pause time is greatly reduced compared with the first two.

In any case, when Stop The World really occurs, there will be situations where GC threads are taking up cpu work while other threads hang, which naturally shows that the us of a cpu is high and the us of his cpu is low.

For the problems of FullGC, the troubleshooting ideas are usually as follows:

Ps-ef | grep java, which is used to query the process pid of Java application.

Jstat-gcutil pid 1000 1000, print memory every 1 second for a total of 1000 times, observe the memory utilization and FullGC times of Old Age (O) and MetaSpace (MU)

Confirm the occurrence of frequent FullGC. Check the GC log. The path configured for each application GC log is different.

Jmap-dump:format=b,file=filename pid, keep the site

Restart the application to stop the bleeding quickly to avoid causing bigger online problems

Dump out of the content, combined with MAT analysis tools to analyze the memory situation, check the reasons for the emergence of FullGC.

If FullGC only happens in the old age zone, it is easy for more experienced developers to find problems, which are usually caused by some code bug. The FullGC that occurs in MetaSpace is often weird and obscure, many of which are related to the improper use of the introduced third-party framework or are caused by bug in the third-party framework, which is time-consuming to troubleshoot.

So how does frequent FullGC eventually lead to changes in load? I have not verified and seen the specific data, but through theoretical analysis, if all threads are idle and only GC threads are doing FullGC all the time, then load will eventually approach 1. But it's not really possible, because how can it lead to frequent FullGC if there are no other threads running? So, after Stop The World in the case of other threads processing the task, the cpu hangs and the task is not processed, and it is more likely that the load will keep rising.

Finally, by the way, we have been talking about FullGC, and frequent YoungGC can also lead to an increase in load. In a previous case, Object to xml,xml to Object, every place in the code new XStream () to xml serialization and deserialization, the recovery speed can not keep up with the speed of new, and the number of YoungGC increases sharply.

VI. Troubleshooting ideas for high load and low cpu

With regard to the part of load, we can see several factors that can lead to high load:

The thread is using cpu

The thread is waiting to use cpu

The thread is performing an uninterruptible IO operation.

Since the cpu is not high and the load is high, the thread is either io or waiting to use cpu. However, I have doubts about the latter "waiting to use cpu". For example, if there are 10 threads in the thread pool, the task is very slow, only one thread will be used at a time, then 9 threads are waiting to use cpu, but these 9 threads obviously will not occupy system resources, so I think nature will not consume cpu, so this point is not considered.

Therefore, in the case of low cpu, if the load is high, the high probability of io is the culprit, which results in the task running all the time, delayed processing, and the thread can not return to the thread pool. First of all, let's briefly talk about disk io. Since wa represents the percentage of disk io waiting for cpu, we can check whether disk io is caused by wa:

If so, print the stack in the same way as cpu High, look at the part of the file io for analysis, and find out why, for example, multithreading is reading a very large local file into memory.

Disk io causes high load, which I believe is a minority after all. Because of the characteristics of the Java language, applications with more high io should be dealing with network requests, such as:

Get data from the database

Get data from Redis

Call Http interface to get data from Alipay

Get the data in a service through dubbo.

In view of this situation, I think first of all we should be familiar with the dependence of the entire system architecture. For example, I draw a sketch:

Any high time-consuming call to the relying party will increase the load of its own system. It is recommended that the troubleshooting method with high load is:

Check the log, whether it is called by HBase, MySql, Redis or through http or dubbo, if the call times out and the connection in the connection pool times out, there will usually be an error log thrown. As long as the system does not catch an exception and then swallows it without typing the log, you can generally find the relevant exception.

For dubbo and http calls, it is recommended to monitor the buried point, output necessary parameters such as interface name, method input parameters (control size), success, and call duration. Sometimes there may be no timeout, but calling for 2 or 3 seconds will also cause load to rise, so in this case, you need to check the method call duration for the next step.

If the above steps are still useless or do not bury the call to the interface, then you should print the stack omnipotently, printing five or ten times in a row to see if most of the stack points to the call to the same interface. If there is a network io, the last few lines of the stack usually have at java.net.SocketInputStream.read (SocketInputStream.java:129).

Summary of several reasons for the high application of load in Java

Having said so much, here are some of the possible reasons for the high load:

An endless loop or an unreasonable large number of loop operations, if it is not a loop operation, it will only take a while to process a large piece of code according to the processing speed of modern cpu, which basically consumes no capacity.

Frequent YoungGC

Frequent FullGC

High disk IO

High network IO.

The high load of the system is usually caused by the unreasonable use of bug or the introduction of some third-party jar in a certain section of code, so we should first distinguish whether the high load is caused by the high cpu or the high io, and take different ways to locate the problem according to different scenarios.

When at a loss, jstack print stack analysis bar, may be able to find the cause of the error at a glance.

Thank you for reading this article carefully. I hope the article "how to troubleshoot CPU and Load anomalies in linux" shared by the editor will be helpful to you. At the same time, I also hope you will support us and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.