In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-30 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)06/01 Report--
In this issue, the editor will bring you about the bug troubleshooting process related to the linux kernel. The article is rich in content and analyzed and described from a professional point of view. I hope you can get something after reading this article.
Writing code is only one of the work of programmers, debugging code will even take more time than writing code. I have explained a lot about the system, architecture, programming and other aspects for you before. We mainly show you a comprehensive bug troubleshooting process related to the kernel.
Find a problem
It is said that one day the company server called the police, logged on to the machine and found that the process had been "stuck". There was no response to conventional GDB debugging, and there were no clues to find Log. The problem seemed to be unsolved.
Just then the blogger's mind came to the mind of the island nation. Yes, you guessed wrong, it is Brother Yixiu of the island, Conandi, Bao Qingtian at home, Di Renjie, Sherlock abroad, and so on. There must be a way. Right!
Analyze the problem
First, let's make a careful analysis. Since the process appears to be stuck, if it is stuck in the user mode, the CPU utilization of the process must be very high (such as an endless loop); if it is stuck in the kernel state, the process should be performing IO or network communication, then the CPU utilization rate should be very low. Now you can still find the process ID. If you have the process ID, run the top command to have a look:
Notice the CPU column, which shows that the CPU occupancy rate is 0%. We find that the process almost does not occupy CPU at this time, which basically tells us that the process is stuck in the kernel state and that the process wants to enter the kernel state. If the process wants to enter the kernel state, it is suspended by the operating system because of a blocking system call, so how do we know what system calls are called by the process?
Trace process system call
The strace command is used to tell you this. Run the strace command to see what system calls are being called by the process at this time:
Oops! Strace command is also stuck, helpless, and then think of any other way.
Trace process user mode runtime stack
Now, you can use the pstack command, which can print out the process runtime stack information. Although this command cannot trace to the kernel, you can see what function is finally called in the user mode, and infer what system call has been called. Let's run it:
Like strace, pstack is stuck.
Where else can we find clues now?
The old ps command is never out of date.
We can use the ps command to see the running status of the process and WCHAN (waiting channel).
What does WCHAN mean?
In the Linux world, there are questions for men (man). This is the omnipotent man command. Let's use the man command to take a look at what ps shows:
$man ps
Run the man command and search for "WCHAN", AHA! Finally, the meaning of WCHAN is found in the "STANDARD FORMAT SPECIFIERS" section, which reads as follows:
It is clearly written here that WCHAN refers to which kernel function the current process is blocking.
OK, let's run the ps command:
It is worth noting here, because ps prints only the status of the corresponding process at the time of running the ps command, that is to say, running ps is equivalent to a sample, so you should run ps several times more to make sure that the result does not change, otherwise you will probably get an error clue if you run it only once and the time is smart enough.
Two process blocking states
From the results printed by ps, you can see that the running state of the process is D. what does the running state D mean? We asked man again and found this message:
The original process running status D means uninterruptible sleep, uninterruptible sleep, which means that the process is sleeping, even if you slap it will not wake up, that is, the process does not currently respond to any external signal, even if the kill command can not kill the process (unless the kernel allows the process to receive kill signals), the intuitive feeling is that the process is "stuck".
As opposed to the uninterruptible sleep, the interruptible sleep has a status of S from the figure above. At this time, the process is blocking and waiting for an event (such as the arrival of network data, etc.). The process in this state can receive signals, which means that the process is still responsive.
Through the ps command, we can see that the process status is D, which further verifies that the process is indeed "stuck".
So where is the process stuck?
Fortunately, the WCHAN column can tell you the answer.
Which kernel function does the process block on?
The above ps command WCHAN column shows rpc_wa, um. Rpc_wa what? It looks like it has been truncated, but it doesn't matter. We can find the complete output of wchan from the source. In fact, commands such as ps also look for information and display it at this source. The source is the proc file system, which records the kernel and the runtime information of each process. We can use the simplest cat command, proc followed by process ID and wchan:
AHA, we finally found out where the process is stuck at this time!
It looks like the process is waiting for a RPC call. RPC is actually one process communicating with another process network. Although we know where the process is stuck, we still don't know why it is stuck here.
At this point, the clue seems to have been interrupted.
Dense willow trees and bright flowers
Let's think about it again.
Since the process is stuck, then the process must not be in the user state, if it is not in the user state, it must be the kernel state, so how can the process enter the kernel state? The obvious answer is to call a system call.
So how do we know which system call a process is currently calling?
You are lucky dog,Say hi to / proc/***/syscall, we can also use a simple cat command to look in the proc file system, using / proc followed by the process ID+syscall.
WTF . What a bunch of crap this is!
It turns out that this string of seemingly inexplicable things is the system call, the first number represents the system call ID, the next pile of parameters, we do not have to worry about.
From the output above, we can see that what is called is system call No. 262. Only one number is meaningless. Which system call does this number represent?
Check the system call according to the kernel source code
To understand the meaning of this number, we need to refer to the kernel code. Generally speaking, the necessary kernel header files in Linux systems are located in the / usr/include directory. I found this file on the blogger's 64-bit Linux machine:
Gotyou!!! We can see that the newfstatat system call is called. What is the purpose of this system call? Let's ask the man again (man command):
$man newfstatat
Got the following message:
Ah! It turns out to be fstatat, which is reading the meta-information of the file.
Now we know what system call to call, but a new problem arises again, that is, why do we end up getting stuck waiting for a rpc after we call this system call?
Obviously we need to call the stack information to verify.
Trace the kernel runtime stack
OOOOKey, it's time to come up with a heavyweight tool, which is / proc/PID/stack. By simply looking at this file, we can know the call stack of the corresponding process in the kernel! Just ask you if the design of Linux is very good, is there any!
This kernel call stack finally uncovers all the secrets.
The truth has become known to all
First of all, let's take a look at the top of the call stack, which is printed by the column of the ps command WCHAN, and the process is stuck in the kernel because of calling this function.
Next, from the bottom of the call stack, we find the system call, which confirms that it is the process call that causes the system call to get stuck.
So what happens when you call this system call? Let's move on and notice these lines:
Finally!!! From the call stack, we can see a series of functions related to NFS. The full name of NFS is Network File System, that is, the network file system. We usually mount (mount) a remote file system that is implemented by NFS. It is NFS that communicates on the network that leads to waiting on the rpc.
We know from the kernel call stack that the process is stuck due to network problems when querying the metadata of files on a remote host.
Through this clue, we finally locked in the problematic code.
This is what the editor shares with you about the bug troubleshooting process involving the linux kernel. If you happen to have similar doubts, please refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.