How to analyze deadlock of numa loadbance 04/22 Update SLTechnology News&Howtos

How to analyze deadlock of numa loadbance

2025-04-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

How to carry out numa loadbance deadlock analysis, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain for you in detail, people with this need can come to learn, I hope you can gain something.

Background: this is a case of crash encountered in 3.10.0-957.el7.x86_64. The following is a list of how we troubleshoot and solve this problem.

I. failure phenomenon

Oppo Cloud Intelligent Monitoring found that the machine downmachine:

KERNEL: / usr/lib/debug/lib/modules/3.10.0-957.el7.x86_64/vmlinux.... PANIC: "Kernel panic-not syncing: Hard LOCKUP" PID: 14 COMMAND: "migration/1" TASK: ffff8f1bf6bb9040 [THREAD_INFO: ffff8f1bf6bc4000] CPU: 1 STATE: TASK_INTERRUPTIBLE (PANIC)

Crash > btPID: 14 TASK: ffff8f1bf6bb9040 CPU: 1 COMMAND: "migration/1" # 0 [ffff8f4afbe089f0] machine_kexec at ffffffff83863674 # 1 [ffff8f4afbe08a50] _ crash_kexec at ffffffff8391ce12 # 2 [ffff8f4afbe08b20] panic at ffffffff83f5b4db # 3 [ffff8f4afbe08ba0] nmi_panic at ffffffff8389739f # 4 [ffff8f4afbe08bb0] watchdog_overflow_callback at ffffffff83949241 # 5 [ffff8f4afbe08bc8] _ perf_event_overflow at ffffffff839a1027 # 6 [ffff8f4afbe08c00] perf_event_overflow at ffffffff839aa694 # 7 [ffff8f4afbe08c10] intel_pmu_handle_irq at ffffffff8380a6b0 # 8 [ffff8f4afbe08e38 ] perf_event_nmi_handler at ffffffff83f6b031 # 9 [ffff8f4afbe08e58] nmi_handle at ffffffff83f6c8fc#10 [ffff8f4afbe08eb0] do_nmi at ffffffff83f6cbd8#11 [ffff8f4afbe08ef0] end_repeat_nmi at ffffffff83f6bd69 [exception RIP: native_queued_spin_lock_slowpath+462] RIP: ffffffff839121ae RSP: ffff8f1bf6bc7c50 RFLAGS: 00000002 RAX: 00000000000001 RBX: 0000000000000082 RCX: 00000000000001 RDX: 000000000001 RDI: 000000000001 RDI: ffff8f1afdf55fe8--- lock RBP: ffff8f1bf6bc7c50 R8: 0000000000000101 R9: 00000000000400 R10: 0000000000499e R11: 0000000000499f R12: ffff8f1afdf55fe8 R13: ffff8f1bf5150000 R14: ffff8f1afdf5b488 R15: ffff8f1bf5187818 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 Murray-- # 12 [ffff8f1bf6bc7c50] native_queued_spin_lock_slowpath at ffffffff839121ae#13 [ffff8f1bf6bc7c58] queued_spin_lock_slowpath at ffffffff83f5bf4b#14 [ffff8f1bf6bc7c68] _ raw_spin_lock_irqsave at ffffffff83f6a487#15 [ffff8f1bf6bc7c80] cpu_stop_queue_work at ffffffff8392fc70#16 [ffff8f1bf6bc7cb0] stop_one_cpu_nowait at ffffffff83930450#17 [ffff8f1bf6bc7cc0] load_balance at ffffffff838e4c6e#18 [ffff8f1bf6bc7da8] idle_balance at ffffffff838e5451# 19 [ffff8f1bf6bc7e00] _ schedule at ffffffff83f67b14#20 [ffff8f1bf6bc7e88] schedule at ffffffff83f67bc9#21 [ffff8f1bf6bc7e98] smpboot_thread_fn at ffffffff838ca562#22 [ffff8f1bf6bc7ec8] kthread at ffffffff838c1c31#23 [ffff8f1bf6bc7f50] ret_from_fork_nospec_begin at ffffffff83f74c1dcrash >

Second, fault phenomenon analysis

Hardlock is generally due to the long shutdown interrupt time. From the stack, the "migration/1" process above is grabbing the spinlock. Since _ raw_spin_lock_irqsave calls arch_local_irq_disable first and then goes to get the lock, and arch_local_irq_disable is a common off interrupt function, let's analyze who holds the lock that the process wants to hold.

Under x86 architecture, the rdi of native_queued_spin_lock_slowpath stores the lock address.

Crash > arch_spinlock_t ffff8f1afdf55fe8struct arch_spinlock_t {val = {counter = 257}

Next, we need to know what kind of lock this is. Analyze idle_balance-- > load_balance-- > stop_one_cpu_nowait-- > cpu_stop_queue_work from the call chain to disassemble the code blocked by cpu_stop_queue_work:

Crash > dis-l ffffffff8392fc70/usr/src/debug/kernel-3.10.0-957.el7/linux-3.10.0-957.el7.x86_64/kernel/stop_machine.c: 910xffffffff8392fc70: cmpb $0x0author0xc (% rbx)

85 static void cpu_stop_queue_work (unsigned int cpu, struct cpu_stop_work * work) 86 {87 struct cpu_stopper * stopper = & per_cpu (cpu_stopper, cpu); 88 unsigned long flags; 89 90 spin_lock_irqsave (& stopper- > lock, flags) -so the lock 91 if (stopper- > enabled) 92 _ cpu_stop_queue_work (stopper, work); 93 else 94 cpu_stop_signal_done (work- > done, false); 95 spin_unlock_irqrestore (& stopper- > lock, flags); 96}

It seems that you need to get the corresponding percpu variable cpu_stopper according to the cpu number. This input parameter finds the busiest rq in the load_balance function, and then gets its corresponding cpu number:

6545 static int load_balance (int this_cpu, struct rq * this_rq, 6546 struct sched_domain * sd, enum cpu_idle_type idle, 6547 int * should_balance) 6548 {. 6735 if (active_balance) {6736 stop_one_cpu_nowait (cpu_of (busiest), 6737 active_load_balance_cpu_stop, busiest, 6738 & busiest- > active_balance_work) 6739}.... 6781}

Crash > dis-l load_balance | grep stop_one_cpu_nowait-B 60xffffffff838e4c4d: callq 0xffffffff83f6a0e0 / usr/src/debug/kernel-3.10.0-957.el7/linux-3.10.0-957.el7.x86_64/kernel/sched/fair.c: 67360xffffffff838e4c52: mov 0x930 (% rbx),% edi- can take cpu number according to rbx Rbx is the busiest rq0xffffffff838e4c58: lea 0x908 (% rbx),% rcx0xffffffff838e4c5f: mov% rbx,%rdx0xffffffff838e4c62: mov $0xffffffff838de690 Magi% rsi0xffffffff838e4c69: callq 0xffffffff83930420

Then the data we fetched from the stack is as follows:

The busiest group is: crash > rq.cpu ffff8f1afdf5ab80 cpu = 26

In other words, cpu 1 is waiting for the lock of cpu 26 of the percpu variable cpu_stopper.

Then we searched which other process's stack the lock was in and found the following:

Ffff8f4957fbfab0: ffff8f1afdf55fe8-this is on the 355608 stack crash > kmem ffff8f4957fbfab0 PID: 355608COMMAND: "custom_exporter" TASK: ffff8f4aea3a8000 [THREAD_INFO: ffff8f4957fbc000] CPU: 26-also happens to be the process STATE: TASK_RUNNING (ACTIVE) running on cpu 26.

Next, we need to analyze why the process custom_exporter at cpu 26 has been holding ffff8f1afdf55fe8 for a long time.

Let's analyze the stack of cpu 26:

Crash > bt-f 355608PID: 355608 TASK: ffff8f4aea3a8000 CPU: 26 COMMAND: "custom_exporter". # 3 [ffff8f1afdf48ef0] end_repeat_nmi at ffffffff83f6bd69 [exception RIP: try_to_wake_up+114] RIP: ffffffff838d63d2 RSP: ffff8f4957fbfa30 RFLAGS: 00000002 RAX: 00000000000001 RBX: ffff8f1bf6bb9844 RCX: 00000000000000 RDX: 00000000000001 RSI: 00000000000003 RDI: ffff8f1bf6bb9844 RBP: ffff8f4957fbfa70 R8: ffff8f4afbe15ff0 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R13: ffff8f1bf6bb9040 R14: 00000000000000 R15: 00000000000003 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 00000000000000-- # [ffff8f4957fbfa30] try_to_wake_ Up at ffffffff838d63d2 ffff8f4957fbfa38: 000000000001ab80 0000000000000086 ffff8f4957fbfa48: ffff8f4afbe15fe0 ffff8f4957fbfb48 ffff8f4957fbfa58: 0000000000000001 ffff8f4afbe15fe0 ffff8f4957fbfa68: ffff8f1afdf55fe0 ffff8f4957fbfa80 ffff8f4957fbfa78: ffffffff838d6705 # 5 [ffff8f4957fbfa78] wake_up_process at ffffffff838d6705 ffff8f4957fbfa80: ffff8f4957fbfa98 ffffffff8392fc05 # 6 [ffff8f4957fbfa88] _ _ cpu_stop_queue_work at ffffffff8392fc05 ffff8f4957fbfa90: 000000000000001a ffff8f4957fbfbb0 ffff8f4957fbfaa0: ffffffff8393037a # 7 [ffff8f4957fbfaa0] stop_two_cpus at ffffffff8393037a. Ffff8f4957fbfbb8: ffffffff838d3867 # 8 [ffff8f4957fbfbb8] migrate_swap at ffffffff838d3867 ffff8f4957fbfbc0: ffff8f4aea3a8000 ffff8f1ae77dc100-migration_swap_arg ffff8f4957fbfbd0: 000000010000001a 0000000080490f7c ffff8f4957fbfbe0 in the stack: ffff8f4aea3a8000 ffff8f4957fbfc30 ffff8f4957fbfbf0: 000000000076000000000076 ffff8f4957fbfc00: 0000000000371 ffff8f4957fbfce8 ffff8f4957fbfc10: ffffffff838dd0ba # 9 [ffff8f4957fbfc10] task_numa_migrate at ffffffff838dd0ba ffff8f4957fbfc18: ffff8f1afc121f40 00000000001a ffff8f4957fbfc28: 0000000000371 ffff8f4aea3a8000-- here ffff8f4957fbfc30 is task_numa_env 's address stored in the stack ffff8f4957fbfc38: 00000000001a 00000003f ffff8f4957fbfc48: 000000000000000b 0000000000022c Ffff8f4957fbfc58: 00000000000049a0 00000000000012 ffff8f4957fbfc68: 000000000001 00000000003 ffff8f4957fbfc78: 0000000000006f 0000000000499f ffff8f4957fbfc88: 00000000000012 000000000001 ffff8f4957fbfc98: 000000000070 ffff8f1ae77dc100 ffff8f4957fbfca8: 00000000000002fb 000000000001 ffff8f4957fbfcb8: 0000000080490f7c ffff8f4aea3a8000-- rbx press the stack here So this is current ffff8f4957fbfcc8: 0000000000017a48 00000000001818 ffff8f4957fbfcd8: 000000000018 ffff8f4957fbfe20 ffff8f4957fbfce8: ffff8f4957fbfcf8 ffffffff838dd4d3 # 10 [ffff8f4957fbfcf0] numa_migrate_preferred at ffffffff838dd4d3 ffff8f4957fbfcf8: ffff8f4957fbfd88 ffffffff838df5b0. Steps > crash >

On the whole, cpu on the 26th is also performing the balance action of numa. Let's briefly introduce the action of numa under balance in the task_tick_fair function:

Static void task_tick_fair (struct rq * rq, struct task_struct * curr, int queued) {struct cfs_rq * cfs_rq; struct sched_entity * se = & curr- > se

For_each_sched_entity (se) {cfs_rq = cfs_rq_of (se); entity_tick (cfs_rq, se, queued);}

If (numabalancing_enabled)-if numabalancing is enabled, task_tick_numa task_tick_numa (rq, curr) will be called

Update_rq_runnable_avg (rq, 1);}

According to the scan, task_tick_numa will push the current process to a work when it needs numa_balance. By calling change_prot_numa, all the PTE page table entries mapped to VMA should be PAGE_NONE, so that the next time the process accesses the page table, a page fault occurs. The handle_pte_fault function will choose a better node according to numa because of the opportunity of page fault interrupt, and no longer expand.

In the call chain of cpu 26, the stop_two_cpus-- > cpu_stop_queue_two_works-- > cpu_stop_queue_work function because cpu_stop_queue_two_works is inlined, but cpu_stop_queue_two_works calls cpu_stop_queue_work twice, so you need to determine which call has a problem according to the stack address.

227 static int cpu_stop_queue_two_works (int cpu1, struct cpu_stop_work * work1, 228 int cpu2, struct cpu_stop_work * work2) 229 {230 struct cpu_stopper * stopper1 = per_cpu_ptr (& cpu_stopper, cpu1); 231 struct cpu_stopper * stopper2 = per_cpu_ptr (& cpu_stopper, cpu2); 232 int err 233,234 lg_double_lock (& stop_cpus_lock, cpu1, cpu2); 235 spin_lock_irq (& stopper1- > lock);-notice that stopper1's lock 236 spin_lock_nested (& stopper2- > lock, SINGLE_DEPTH_NESTING) has been held here. 2443 _ cpu_stop_queue_work (stopper1, work1); 244cpu_stop_queue_work (stopper2, work2);. 251}

According to the address of the stack:

# 5 [ffff8f4957fbfa78] wake_up_process at ffffffff838d6705 ffff8f4957fbfa80: ffff8f4957fbfa98 ffffffff8392fc05 # 6 [ffff8f4957fbfa88] _ _ cpu_stop_queue_work at ffffffff8392fc05 ffff8f4957fbfa90: 000000000000001a ffff8f4957fbfbb0 ffff8f4957fbfaa0: ffffffff8393037a # 7 [ffff8f4957fbfaa0] stop_two_cpus at ffffffff8393037a ffff8f4957fbfaa8: 0000000100000001 ffff8f1afdf55fe8

Crash > dis-l ffffffff8393037a 2/usr/src/debug/kernel-3.10.0-957.el7/linux-3.10.0-957.el7.x86_64/kernel/stop_machine.c: 2440xffffffff8393037a: lea 0x48 (% rsp),% rsi0xffffffff8393037f: mov% R15Magi% RDI

It means that the stack is stacked with the address of line 244, that is, the _ _ cpu_stop_queue_work of line 243is being called.

Then analyze the corresponding input parameters:

Crash > task_numa_env ffff8f4957fbfc30struct task_numa_env {p = 0xffff8f4aea3a8000, src_cpu = 26, src_nid = 0, dst_cpu = 63, dst_nid = 1, src_stats = {nr_running = 11, load = 556,-- load high compute_capacity = 18848,-capacity equivalent task_capacity = 18, has_free_capacity = 1}, dst_stats = {nr_running = 3, load = 111, -low load And the capacity is equal. To migrate to compute_capacity = 18847,-capacity equivalent to task_capacity = 18, has_free_capacity = 1}, imbalance_pct = 112, idx = 0, best_task = 0xffff8f1ae77dc100,-task to be exchanged Is obtained through task_numa_find_cpu-- > task_numa_compare-- > task_numa_assign to obtain best_imp = 763, best_cpu = 1 swap-the best object for cpu 1}

Crash > migration_swap_arg ffff8f4957fbfbc0 struct migration_swap_arg {src_task = 0xffff8f4aea3a8000, dst_task = 0xffff8f1ae77dc100, src_cpu = 26, dst_cpu = 1-the selected dst cpu is 1}

According to the code of cpu_stop_queue_two_works, it calls try_to_wake_up when it holds the cpu_stopper:26 number cpu lock, and the object of wake is the kworker used for migrate.

Static void _ cpu_stop_queue_work (struct cpu_stopper * stopper, struct cpu_stop_work * work) {list_add_tail (& work- > list, & stopper- > works); wake_up_process (stopper- > thread); / / generally wakes up migration}

Because the best cpu object is 1, you need migrate on cpu to pull the process.

Crash > p cpu_stopper:1per_cpu (cpu_stopper, 1) = $33 = {thread = 0xffff8f1bf6bb9040,-destination to wake up task lock = {{rlock = {val = {counter = 1}, enabled = true, works = {next = 0xffff8f4957fbfac0, prev = 0xffff8f4957fbfac0} Stop_work = {list = {next = 0xffff8f4afbe16000, prev = 0xffff8f4afbe16000}, fn = 0xffffffff83952100, arg = 0x0 Done = 0xffff8f1ae3647c08}} crash > kmem 0xffff8f1bf6bb9040CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZEffff8eecffc05f00 task_struct 4152 1604 2219 317 32k SLAB MEMORY NODE TOTAL ALLOCATED FREE fffff26501daee00 ffff8f1bf6bb8000 17 7 0 FREE / [ALLOCATED] [ffff8f1bf6bb9040]

PID: 14COMMAND: "migration/1"-the destination task is the migration TASK on the corresponding cpu: ffff8f1bf6bb9040 [THREAD_INFO: ffff8f1bf6bc4000] CPU: 1 STATE: TASK_INTERRUPTIBLE (PANIC)

PAGE PHYSICAL MAPPING INDEX CNT FLAGSfffff26501daee40 3076bb9000 000 6fffff00008000 tail

The question now is, although we know that the current cpu26 process wakes up the migrate process on cpu 1 when it takes the lock, why does it delay releasing the lock, causing cpu 1 to trigger hardlock's panic because it takes too long to wait for the lock?

Let's analyze why it has held the lock for so long:

# 3 [ffff8f1afdf48ef0] end_repeat_nmi at ffffffff83f6bd69 [exception RIP: try_to_wake_up+114] RIP: ffffffff838d63d2 RSP: ffff8f4957fbfa30 RFLAGS: 00000002 RAX: 00000000000001 RBX: ffff8f1bf6bb9844 RCX: 00000000000000 RDX: 00000000000001 RSI: 00000000000003 RDI: ffff8f1bf6bb9844 RBP: ffff8f4957fbfa70 R8: ffff8f4afbe15ff0 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R13: ffff8f1bf6bb9040 R14: 00000000000000 R15: 00000000000003 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 00000000000000-- # [ffff8f4957fbfa30] try_to_wake_ Up at ffffffff838d63d2 ffff8f4957fbfa38: 000000000001ab80 0000000000000086 ffff8f4957fbfa48: ffff8f4afbe15fe0 ffff8f4957fbfb48 ffff8f4957fbfa58: 0000000000000001 ffff8f4afbe15fe0 ffff8f4957fbfa68: ffff8f1afdf55fe0 ffff8f4957fbfa80

Crash > dis-l ffffffff838d63d2/usr/src/debug/kernel-3.10.0-957.el7/linux-3.10.0-957.el7.x86_64/kernel/sched/core.c: 17900xffffffff838d63d2: mov 0x28 (% R13),% eax

1721 static int 1722 try_to_wake_up (struct task_struct * p, unsigned int state, int wake_flags) 1723 {. 1787 * If the owning (remote) cpu is still in the middle of schedule () with 1788 * this task as prev, wait until its done referencing the task. 1789 * / 1790 while (p-> on_cpu)-the original loop is 1791 cpu_relax (). 1814 return success; 1815}

Let's use a simple diagram to represent this hardlock:

CPU1 CPU26 schedule (.prev = migrate/1) pick_next_task ()... Idle_balance () migrate_swap () active_balance () stop_two_cpus () spin_lock (stopper0- > lock) spin_lock (stopper1- > lock) Try_to_wake_up pause ()-waits for schedule () stop_one_cpu (1) spin_lock (stopper26- > lock)-waits for stopper lock

Check the upstream patch

Static void _ cpu_stop_queue_work (struct cpu_stopper * stopper,- struct cpu_stop_work * work) + struct cpu_stop_work * work,+ struct wake_q_head * wakeq) {list_add_tail (& work- > list, & stopper- > works);-wake_up_process (stopper- > thread); + wake_q_add (wakeq, stopper- > thread);}

Third, failure recurrence

Because this is a hardlock caused by race condition, logical analysis has no problem, it does not take the time to reproduce, the environment runs a dpdk node, but for performance settings only run on a numa node, which can frequently cause numa imbalance, so students who want to repeat can refer to running dpdk on a single numa node to reproduce, which will have a higher probability.

Fourth, failure avoidance or resolution

Our solution is:

1. Turn off the automatic balance of numa.

two。 Manually incorporate the 0b26351b910f patch into the linux community

3. This patch is incorporated in centos's 3.10.0-974.el7:

[kernel] stop_machine, sched: Fix migrate_swap () vs. Active_balance () deadlock (Phil Auld) [1557061]

At the same time, Red Hat converged to 3.10.0-957.27.2.el7.x86_64, so upgrading the centos kernel to 3.10.0-957.27.2.el7.x86_64 is also an option.

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.