Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

The process and solution of the endless cycle of user-mode process in linux system

2025-02-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly explains "the process and solution of the dead cycle of user-mode process in linux system". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn the process and solutions of the endless cycle of user-mode processes in the linux system.

1. Problem phenomenon

Business processes (user-mode multithreaded programs) hang up, the operating system is slow to respond, and there is nothing unusual in the system log. Judging from the kernel state stack of the process, it seems that all threads are stuck in the following stack flow in kernel state:

[root@vmc116 ~] # cat / proc/27007/task/11825/stack

[] retint_careful+0x14/0x32

[] 0xffffffffffffffff

2. Problem analysis.

1) Kernel stack analysis

From the point of view of the kernel stack, all the processes are blocked on the retint_careful. This is the flow during the interrupt return process. The code (assembly) is as follows:

Entry_64.S

The code is as follows:

Ret_from_intr:

DISABLE_INTERRUPTS (CLBR_NONE)

TRACE_IRQS_OFF

Decl PER_CPU_VAR (irq_count)

/ * Restore saved previous stack * /

Popq rsi

CFI_DEF_CFA rsi,SS+8-RBP / * reg/off reset after def_cfa_expr * /

Leaq ARGOFFSET-RBP (% rsi),% rsp

CFI_DEF_CFA_REGISTER rsp

CFI_ADJUST_CFA_OFFSET RBP-ARGOFFSET

. . .

Retint_careful:

CFI_RESTORE_STATE

Bt $TIF_NEED_RESCHED,%edx

Jnc retint_signal

TRACE_IRQS_ON

ENABLE_INTERRUPTS (CLBR_NONE)

Pushq_cfi rdi

SCHEDULE_USER

Popq_cfi rdi

GET_THREAD_INFO (% rcx)

DISABLE_INTERRUPTS (CLBR_NONE)

TRACE_IRQS_OFF

Jmp retint_check

This is actually the process that the user-mode process returns from the interrupt after the user-mode process is interrupted, combined with retint_careful+0x14/0x32, to disassemble, you can confirm that the blocking point is actually

SCHEDULE_USER

This is actually a call to schedule () for scheduling, that is, when the process goes into the process returned by the interrupt, it finds that it needs to be scheduled (TIF_NEED_RESCHED is set), so scheduling occurs here.

There is a question: why can't you see the stack frame at the level of schedule () in the stack?

Because this is called directly by assembly, there is no related stack frame stacking and context saving operations.

2) analyze the state information

Judging from the results of the top command, the relevant thread has actually been in R state all the time, the CPU is almost completely exhausted, and most of them are consumed in user mode:

[root@vmc116 ~] # top

Top-09:42:23 up 16 days, 2:21, 23 users, load average: 84.08,84.30,83.62

Tasks: 1037 total, 85 running, 952 sleeping, 0 stopped, 0 zombie

Cpu (s): 97.6%us, 2.2%sy, 0.2%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Mem: 32878852k total, 32315464k used, 563388k free, 374152k buffers

Swap: 35110904k total, 38644k used, 35072260k free, 28852536k cached

PID USER PR NI VIRT RES SHR S CPU MEM TIME+ COMMAND

27074 root 20 0 5316m 163m 14m R 10.2 0.5 321:06.17 z_itask_templat

27084 root 20 0 5316m 163m 14m R 10.2 0.5 296:23.37 z_itask_templat

27085 root 20 0 5316m 163m 14m R 10.2 0.5 337:57.26 z_itask_templat

27095 root 20 0 5316m 163m 14m R 10.2 0.5 327:31.93 z_itask_templat

27102 root 20 0 5316m 163m 14m R 10.2 0.5 306:49.44 z_itask_templat

27113 root 20 0 5316m 163m 14m R 10.2 0.5 310:47.41 z_itask_templat

25730 root 20 0 5316m 163m 14m R 10.2 0.5 283:03.37 z_itask_templat

30069 root 20 0 5316m 163m 14m R 10.2 0.5 283:49.67 z_itask_templat

13938 root 20 0 5316m 163m 14m R 10.2 0.5 261:24.46 z_itask_templat

16326 root 20 0 5316m 163m 14m R 10.2 0.5 150:24.53 z_itask_templat

6795 root 20 0 5316m 163m 14m R 10.2 0.5 100:26.77 z_itask_templat

27063 root 20 0 5316m 163m 14m R 9.9 0.5 337:18.77 z_itask_templat

27065 root 20 0 5316m 163m 14m R 9.9 0.5 314:24.17 z_itask_templat

27068 root 20 0 5316m 163m 14m R 9.9 0.5 336:32.78 z_itask_templat

27069 root 20 0 5316m 163m 14m R 9.9 0.5 338:55.08 z_itask_templat

27072 root 20 0 5316m 163m 14m R 9.9 0.5 306:46.08 z_itask_templat

27075 root 20 0 5316m 163m 14m R 9.9 0.5 316:49.51 z_itask_templat

...

3) process scheduling information

From the scheduling information of related threads:

[root@vmc116 ~] # cat / proc/27007/task/11825/schedstat

15681811525768 129628804592612 3557465

[root@vmc116 ~] # cat / proc/27007/task/11825/schedstat

15682016493013 129630684625241 3557509

[root@vmc116 ~] # cat / proc/27007/task/11825/schedstat

15682843570331 129638127548315 3557686

[root@vmc116 ~] # cat / proc/27007/task/11825/schedstat

15683323640217 129642447477861 3557793

[root@vmc116 ~] # cat / proc/27007/task/11825/schedstat

15683698477621 129645817640726 3557875

It is found that the scheduling statistics of related threads have been increasing, indicating that the relevant threads have been scheduled to run, combined with their state has always been R, it is speculated that there may be a dead loop (or non-sleep deadlock) in the user state.

Here's the question again: why is the CPU occupancy per thread seen from top only about 10%, rather than the 100% occupancy rate caused by the usual endless loop process?

Because there are a large number of threads and the priority is the same, according to the CFS scheduling algorithm, time slices are distributed evenly, and one of the threads is not allowed to monopolize the CPU. The result is that multiple threads are scheduled in turn, consuming all the cpu.

Another question: why is softlockup not detected by the kernel in this case?

Because the priority of the business process is not high, it does not affect the scheduling of the watchdog kernel thread (the highest priority real-time thread), so the softlockup situation does not occur.

Another question: why is it that every time you look at the thread stack, it is always blocked in the retint_careful rather than somewhere else?

Because here (when the interrupt returns) is the time for scheduling, scheduling cannot occur at other points in time (regardless of other circumstances ~), and we must also rely on process scheduling to check the behavior of the thread stack, so every time we check the stack, it is the time for the process (cat command) to check the stack to be scheduled, and it is time for the interrupt to return, so the blocking point we happen to see is retint_careful.

4) user state analysis

From the above analysis, it is speculated that there should be a deadlock in the user mode.

User mode confirmation method:

Deploy the debug information, then gdb attach the related processes, confirm the stack, and analyze the code logic.

Finally, it is confirmed that the problem is an endless loop in the process of user mode.

At this point, I believe you have a deeper understanding of "the process and solutions of the dead cycle of user-mode processes in the linux system". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report