In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly explains "the process and solution of the dead cycle of user-mode process in linux system". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn the process and solutions of the endless cycle of user-mode processes in the linux system.
1. Problem phenomenon
Business processes (user-mode multithreaded programs) hang up, the operating system is slow to respond, and there is nothing unusual in the system log. Judging from the kernel state stack of the process, it seems that all threads are stuck in the following stack flow in kernel state:
[root@vmc116 ~] # cat / proc/27007/task/11825/stack
[] retint_careful+0x14/0x32
[] 0xffffffffffffffff
2. Problem analysis.
1) Kernel stack analysis
From the point of view of the kernel stack, all the processes are blocked on the retint_careful. This is the flow during the interrupt return process. The code (assembly) is as follows:
Entry_64.S
The code is as follows:
Ret_from_intr:
DISABLE_INTERRUPTS (CLBR_NONE)
TRACE_IRQS_OFF
Decl PER_CPU_VAR (irq_count)
/ * Restore saved previous stack * /
Popq rsi
CFI_DEF_CFA rsi,SS+8-RBP / * reg/off reset after def_cfa_expr * /
Leaq ARGOFFSET-RBP (% rsi),% rsp
CFI_DEF_CFA_REGISTER rsp
CFI_ADJUST_CFA_OFFSET RBP-ARGOFFSET
. . .
Retint_careful:
CFI_RESTORE_STATE
Bt $TIF_NEED_RESCHED,%edx
Jnc retint_signal
TRACE_IRQS_ON
ENABLE_INTERRUPTS (CLBR_NONE)
Pushq_cfi rdi
SCHEDULE_USER
Popq_cfi rdi
GET_THREAD_INFO (% rcx)
DISABLE_INTERRUPTS (CLBR_NONE)
TRACE_IRQS_OFF
Jmp retint_check
This is actually the process that the user-mode process returns from the interrupt after the user-mode process is interrupted, combined with retint_careful+0x14/0x32, to disassemble, you can confirm that the blocking point is actually
SCHEDULE_USER
This is actually a call to schedule () for scheduling, that is, when the process goes into the process returned by the interrupt, it finds that it needs to be scheduled (TIF_NEED_RESCHED is set), so scheduling occurs here.
There is a question: why can't you see the stack frame at the level of schedule () in the stack?
Because this is called directly by assembly, there is no related stack frame stacking and context saving operations.
2) analyze the state information
Judging from the results of the top command, the relevant thread has actually been in R state all the time, the CPU is almost completely exhausted, and most of them are consumed in user mode:
[root@vmc116 ~] # top
Top-09:42:23 up 16 days, 2:21, 23 users, load average: 84.08,84.30,83.62
Tasks: 1037 total, 85 running, 952 sleeping, 0 stopped, 0 zombie
Cpu (s): 97.6%us, 2.2%sy, 0.2%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 32878852k total, 32315464k used, 563388k free, 374152k buffers
Swap: 35110904k total, 38644k used, 35072260k free, 28852536k cached
PID USER PR NI VIRT RES SHR S CPU MEM TIME+ COMMAND
27074 root 20 0 5316m 163m 14m R 10.2 0.5 321:06.17 z_itask_templat
27084 root 20 0 5316m 163m 14m R 10.2 0.5 296:23.37 z_itask_templat
27085 root 20 0 5316m 163m 14m R 10.2 0.5 337:57.26 z_itask_templat
27095 root 20 0 5316m 163m 14m R 10.2 0.5 327:31.93 z_itask_templat
27102 root 20 0 5316m 163m 14m R 10.2 0.5 306:49.44 z_itask_templat
27113 root 20 0 5316m 163m 14m R 10.2 0.5 310:47.41 z_itask_templat
25730 root 20 0 5316m 163m 14m R 10.2 0.5 283:03.37 z_itask_templat
30069 root 20 0 5316m 163m 14m R 10.2 0.5 283:49.67 z_itask_templat
13938 root 20 0 5316m 163m 14m R 10.2 0.5 261:24.46 z_itask_templat
16326 root 20 0 5316m 163m 14m R 10.2 0.5 150:24.53 z_itask_templat
6795 root 20 0 5316m 163m 14m R 10.2 0.5 100:26.77 z_itask_templat
27063 root 20 0 5316m 163m 14m R 9.9 0.5 337:18.77 z_itask_templat
27065 root 20 0 5316m 163m 14m R 9.9 0.5 314:24.17 z_itask_templat
27068 root 20 0 5316m 163m 14m R 9.9 0.5 336:32.78 z_itask_templat
27069 root 20 0 5316m 163m 14m R 9.9 0.5 338:55.08 z_itask_templat
27072 root 20 0 5316m 163m 14m R 9.9 0.5 306:46.08 z_itask_templat
27075 root 20 0 5316m 163m 14m R 9.9 0.5 316:49.51 z_itask_templat
...
3) process scheduling information
From the scheduling information of related threads:
[root@vmc116 ~] # cat / proc/27007/task/11825/schedstat
15681811525768 129628804592612 3557465
[root@vmc116 ~] # cat / proc/27007/task/11825/schedstat
15682016493013 129630684625241 3557509
[root@vmc116 ~] # cat / proc/27007/task/11825/schedstat
15682843570331 129638127548315 3557686
[root@vmc116 ~] # cat / proc/27007/task/11825/schedstat
15683323640217 129642447477861 3557793
[root@vmc116 ~] # cat / proc/27007/task/11825/schedstat
15683698477621 129645817640726 3557875
It is found that the scheduling statistics of related threads have been increasing, indicating that the relevant threads have been scheduled to run, combined with their state has always been R, it is speculated that there may be a dead loop (or non-sleep deadlock) in the user state.
Here's the question again: why is the CPU occupancy per thread seen from top only about 10%, rather than the 100% occupancy rate caused by the usual endless loop process?
Because there are a large number of threads and the priority is the same, according to the CFS scheduling algorithm, time slices are distributed evenly, and one of the threads is not allowed to monopolize the CPU. The result is that multiple threads are scheduled in turn, consuming all the cpu.
Another question: why is softlockup not detected by the kernel in this case?
Because the priority of the business process is not high, it does not affect the scheduling of the watchdog kernel thread (the highest priority real-time thread), so the softlockup situation does not occur.
Another question: why is it that every time you look at the thread stack, it is always blocked in the retint_careful rather than somewhere else?
Because here (when the interrupt returns) is the time for scheduling, scheduling cannot occur at other points in time (regardless of other circumstances ~), and we must also rely on process scheduling to check the behavior of the thread stack, so every time we check the stack, it is the time for the process (cat command) to check the stack to be scheduled, and it is time for the interrupt to return, so the blocking point we happen to see is retint_careful.
4) user state analysis
From the above analysis, it is speculated that there should be a deadlock in the user mode.
User mode confirmation method:
Deploy the debug information, then gdb attach the related processes, confirm the stack, and analyze the code logic.
Finally, it is confirmed that the problem is an endless loop in the process of user mode.
At this point, I believe you have a deeper understanding of "the process and solutions of the dead cycle of user-mode processes in the linux system". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.