How to implement process D-state deadlock detection in Linux 07/03 Update SLTechnology News&Howtos

How to implement process D-state deadlock detection in Linux

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

Editor to share with you how to achieve process D status deadlock detection in Linux, I believe most people do not know much about it, so share this article for your reference, I hope you can learn a lot after reading this article, let's go to know it!

The process of Linux has a variety of states, such as the running state of TASK_RUNNING, the stop state of EXIT_DEAD, the waiting state of received signal of TASK_INTERRUPTIBLE, and so on (can be seen in include/linux/sched.h). There is a state waiting for TASK_UNINTERRUPTIBLE, called D state, in which the process does not receive a signal and can only be woken up by wake_up. There are many situations in this state, for example, the mutex lock may set the process to this state, and sometimes the process will set the process to enter this state while waiting for some IO resource to be ready (the wait_event mechanism). In general, the process does not stay in this state for a long time, but if the IO device fails or the process deadlock occurs, the process may be in this state for a long time and can no longer return to the TASK_ runing state. Therefore, in order to find this kind of situation, the kernel has designed a hung task mechanism to detect processes in D state for a long time and issue alarms. This article analyzes the source code of the kernel hung task mechanism and gives an example demonstration.

I. Analysis of hung task mechanism

The hung task mechanism has been introduced into the kernel in a very early version. This paper takes the newer Linux 4.1.15 source code as an example, the amount of code is not much, and the source code file is kernel/hung_task.c.

First of all, the whole process block diagram and design idea are given.

Figure D state deadlock flow chart

Its core idea is to create a kernel monitoring process cycle to monitor each process (task) in state D and count the number of times they schedule between the two detections. if it is found that there is no scheduling between the two monitoring, it can be judged that the process has been in state D and is likely to be deadlocked, so it triggers the alarm log to print and output the basic information of the process. Stack backtracking and registers hold information for kernel developers to locate.

Here is a detailed analysis of the implementation:

[cpp] view plain copy views the code chip derived to my code chip static int _ _ init hung_task_init (void) {atomic_notifier_chain_register (& panic_notifier_list, & panic_block); watchdog_task = kthread_run (watchdog, NULL, "khungtaskd") on CODE; return 0;} subsys_initcall (hung_task_init)

First, if this mechanism is enabled in the kernel configuration, the hung_task_init () function will be called during the kernel's subsys initialization phase to enable the function, and first register the callback with the kernel's panic_notifier_list notification chain:

[cpp] view plain copy looks at the code chip derived from my code chip on CODE static struct notifier_block panic_block = {.notifier _ call = hung_task_panic,}

The hung_task_panic () function is called when the kernel triggers panic, and we'll see what this function does later. Continuing with initialization, calling the kthread_run () function creates a thread called khungtaskd, executes the watchdog () function, and immediately attempts to schedule execution. This thread is the background kernel thread designed to detect D-state deadlock processes.

[cpp] view plain copy views the code chip derived to my code chip / * * kthread which checks for tasks stuck in D state * / static int watchdog (void * dummy) {set_user_nice (current, 0); for (;;) {unsigned long timeout = sysctl_hung_task_timeout_secs on CODE While (schedule_timeout_interruptible (timeout_jiffies (timeout)) timeout = sysctl_hung_task_timeout_secs; if (atomic_xchg (& reset_hung_task, 0)) continue; check_hung_uninterruptible_tasks (timeout);} return 0;}

This process first sets the priority to 0, that is, the general priority, which does not affect other processes. Then enter the main loop (executed every timeout), first let the process sleep, and set the sleep time to

CONFIG_DEFAULT_HUNG_TASK_TIMEOUT can be modified through the kernel configuration option. The default value is 120s. After being awakened at the end of sleep, the atomic variable is identified as reset_hung_task. If it is set, the current round of monitoring is skipped, and the flag is cleared. The identity is set through the reset_hung_task_detector () function (no other programs in the kernel currently use this interface):

[cpp] view plain copy views the code chip derived to my code chip void reset_hung_task_detector (void) {atomic_set (& reset_hung_task, 1);} EXPORT_SYMBOL_GPL (reset_hung_task_detector) on CODE

The next loop * * is the monitoring function check_hung_uninterruptible_tasks (), and the input parameter of the function is the monitoring timeout.

[cpp] view plain copy views the code chip on CODE and derives it to my code chip / * * Check whether a TASK_UNINTERRUPTIBLE does not get woken up for * a really long time (120 seconds). If that happens, print out * a warning. * / static void check_hung_uninterruptible_tasks (unsigned long timeout) {int max_count = sysctl_hung_task_check_count; int batch_count = HUNG_TASK_BATCHING; struct task_struct * g, * t; / * * If the system crashed already then all bets are off, * do not report extra hung tasks: * / if (test_taint (TAINT_DIE) | | did_panic) return Rcu_read_lock (); for_each_process_thread (g, t) {if (! max_count--) goto unlock; if (!-- batch_count) {batch_count = HUNG_TASK_BATCHING; if (! rcu_lock_break (g, t)) goto unlock } / * use "= =" to skip the TASK_KILLABLE tasks waiting on NFS * / if (t-> state = = TASK_UNINTERRUPTIBLE) check_hung_task (t, timeout);} unlock: rcu_read_unlock ();}

First of all, check whether the kernel has been DIE or panic. If so, it indicates that the kernel has already been crash. There is no need to monitor it, and you can return it directly. Note that the did_panic flag here is set to hung_task_panic () in the panic notification chain callback function in the previous article:

[cpp] view plain copy looks at the code chip derived from CODE to my code chip static int hung_task_panic (struct notifier_block * this, unsigned long event, void * ptr) {did_panic = 1; return NOTIFY_DONE;}

Next, if the kernel crash has not been triggered, enter the monitoring process and detect all the processes in the kernel one by one (task task). The process is carried out in the state of RCU locking, so in order to avoid too long locking time in the case of too many processes, a batch_count is set here to detect up to HUNG_TASK_BATCHING processes at a time. At the same time, you can also set the number of tests (max_count=sysctl_hung_task_check_count) of *. The default value is the number of * PID PID_MAX_LIMIT (set by sysctl command).

The function calls the for_each_process_thread () function to poll all processes in the kernel (task task), only determines the timeout of the processes in the TASK_UNINTERRUPTIBLE state, and calls the check_hung_task () function. The input parameters are the task_struct structure and the timeout (120s):

[cpp] view plain copy looks at the code chip derived from my code chip static void check_hung_task (struct task_struct * t, unsigned long timeout) on CODE {unsigned long switch_count = t-> nvcsw + t-> nivcsw; / * * Ensure the task is not frozen. * Also, skip vfork and any other user process that freezer should skip. * / if (unlikely (t-> flags & (PF_FROZEN | PF_FREEZER_SKIP)) return; / * * When a freshly created task is scheduled once, changes its state to * TASK_UNINTERRUPTIBLE without having ever been switched out once, it * musn't be checked. * / if (unlikely (! switch_count)) return; if (switch_count! = t-> last_switch_count) {t-> last_switch_count = switch_count; return;} trace_sched_process_hang (t); if (! sysctl_hung_task_warnings) return If (sysctl_hung_task_warnings > 0) sysctl_hung_task_warnings--

First of all, the total number of scheduling times since the creation of the process is represented by the count accumulation of t-> nvcsw and t-> nivcsw, where t-> nvcsw represents the number of times the process actively abandoned CPU, and t-> nivcsw indicates the number of times it was forcibly preempted. The function then determines several identifiers: (1) skip detection if the process is frozen; (2) do not detect if the number of scheduling is 0.

Next, it is determined whether the number of process scheduling saved from the last detection is the same as that of this time, and if the difference indicates that the process has been scheduled in this round of timeout (120s) time, the adjustment value will be updated and returned, otherwise it will indicate that the process has not been scheduled for timeout (120s) time and has been in the D state. The next trace_sched_process_hang () is not clear for now, and then determines the sysctl_hung_task_warnings flag, which indicates the number of times an alarm needs to be triggered. Users can also configure it through the sysctl command. The default value is 10, that is, if the currently detected process has been in state D, by default, an alarm will be issued here every 2 minutes, for a total of 10 times, and no alarm will be issued after that. Here is the alarm code:

[cpp] view plain copy view the code chip on CODE to derive to my code chip / * * Ok, the task did not get scheduled for more than 2 minutes, * complain: * / pr_err ("INFO: task% Spurs% d blocked for more than% ld seconds.\ n", t-> comm, t-> pid, timeout) Pr_err ("% s% s%. * s\ n", print_tainted (), init_utsname ()-> release, (int) strcspn (init_utsname ()-> version, "), init_utsname ()-> version); pr_err ("\ "echo 0 > / proc/sys/kernel/hung_task_timeout_secs\" disables this message.\ n "); sched_show_task (t) Debug_show_held_locks (t); touch_nmi_watchdog ()

Here, the name of the deadlock task, PID number, timeout, kernel tainted information, sysinfo, kernel stack barktrace, and register information are printed in the console and log. Print lock occupancy if debug lock is turned on and touch nmi_watchdog to prevent nmi_watchdog timeout (nmi_watchdog is not a concern for my ARM environment).

[cpp] view plain copy views the code chip derived from my code chip if (sysctl_hung_task_panic) {trigger_all_cpu_backtrace (); panic ("hung_task: blocked tasks") on CODE;}

* if the sysctl_hung_task_panic identity is set, panic is triggered directly (this value can be configured through the kernel configuration file or set through sysctl).

Second, sample demonstration

Demo environment: raspberry pie b (Linux 4.1.15)

1. First confirm the kernel configuration option to confirm that the hung stak mechanism is enabled

[cpp] view plain copy views the code chip on CODE to derive to my code chip # include # include DEFINE_MUTEX (dlock); static int _ init dlock_init (void) {mutex_lock (& dlock); mutex_lock (& dlock); return 0;} static void _ exit dlock_exit (void) {return } module_init (dlock_init); module_exit (dlock_exit); MODULE_LICENSE ("GPL")

This example program defines a mutex lock, and then repeatedly adds the lock in the init function of the module, causing a deadlock (the mutex_lock () function will call _ _ mutex_lock_slowpath () to set the process to TASK_UNINTERRUPTIBLE state), and the process cannot exit after entering the D state. You can view it with the ps command:

Root@apple:~# busybox ps PID USER TIME COMMAND. 521 root 0:00 insmod dlock.ko.

Then look at the status of the process, and you can see that you have entered the D state.

Root@apple:~# cat / proc/521/status Name: insmod State: d (disk sleep) Tgid: 521 Ngid: 0 Pid: 521

After waiting for two minutes, the debugging serial port will output the following information, which can be seen that it will be output every two minutes:

[360.625466] INFO: task insmod:521 blocked for more than 120 seconds. [360.631878] Tainted: G O 4.1.15 # 5 [360.637042] "echo 0 > / proc/sys/kernel/hung_task_timeout_secs" disables this message. [360.644986] [] (_ schedule) from [] (schedule+0x40/0xa4) [360.652129] [] (schedule) from [] (schedule_preempt_disabled+0x18/0x1c) [360.660570] [] (schedule_preempt_disabled) from [] (_ mutex_lock_slowpath+0x6c/0xe4) [360.670142] [] (_ mutex_lock_slowpath) from [] (mutex_lock+0x44/0x48) [360.678432] [] (mutex_lock) from [] (dlock_init+0x20/0x2c [dlock]) [360.686480] [] (dlock_init [dlock]) from [] (do_one_initcall+0x90/0x1e8) [360.694976] [] (do_one_initcall) from [] (do_init_module+0x6c/0x1c0) [360.703170] [] (do_init_module) from [] (load_module+0x1690/0x1d34) [360.711284] [] ( Load_module) from [] (SyS_init_module+0xdc/0x130) [360.719239] [] (SyS_init_module) from [] (ret_fast_syscall+0x0/0x54) [480.725351] INFO: task insmod:521 blocked for more than 120 seconds. [480.731759] Tainted: G O 4.1.15 # 5 [480.736917] "echo 0 > / proc/sys/kernel/hung_task_timeout_secs" disables this message. [480.744842] [] (_ schedule) from [] (schedule+0x40/0xa4) [480.752029] [] (schedule) from [] (schedule_preempt_disabled+0x18/0x1c) [480.760479] [] (schedule_preempt_disabled) from [] (_ mutex_lock_slowpath+0x6c/0xe4) [480.770066] [] (_ mutex_lock_slowpath) from [] (mutex_lock+0x44/0x48) [480.778363] [] (mutex_lock) from [] (dlock_init+0x20/0x2c [dlock]) [480.786402] [] (dlock_init [dlock]) from [] (do_one_initcall+0x90/0x1e8) [480.794897] [] (do_one_initcall) from [] (do_init_module+0x6c/0x1c0) [480.803085] [] (do_init_module) from [] (load_module+0x1690/0x1d34) [480.811188] [] ( Load_module) from [] (SyS_init_module+0xdc/0x130) [480.819113] [] (SyS_init_module) from [] (ret_fast_syscall+0x0/0x54) [600.825353] INFO: task insmod:521 blocked for more than 120 seconds. [600.831759] Tainted: G O 4.1.15 # 5 [600.836916] "echo 0 > / proc/sys/kernel/hung_task_timeout_secs" disables this message. [600.844865] [] (_ schedule) from [] (schedule+0x40/0xa4) [600.852005] [] (schedule) from [] (schedule_preempt_disabled+0x18/0x1c) [600.860445] [] (schedule_preempt_disabled) from [] (_ mutex_lock_slowpath+0x6c/0xe4) [600.870014] [] (_ mutex_lock_slowpath) from [] (mutex_lock+0x44/0x48) [600.878303] [] (mutex_lock) from [] (dlock_init+0x20/0x2c [dlock]) [600.886339] [] (dlock_init [dlock]) from [] (do_one_initcall+0x90/0x1e8) [600.894835] [] (do_one_initcall) from [] (do_init_module+0x6c/0x1c0) [600.903023] [] (do_init_module) from [] (load_module+0x1690/0x1d34) [600.911133] [] ( Load_module) from [] (SyS_init_module+0xdc/0x130) [600.919059] [] (SyS_init_module) from [] (ret_fast_syscall+0x0/0x54) these are all the contents of the article "how to implement process D-state deadlock detection in Linux" Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.