How to use Trace-Event to solve the problem that the system can not sleep deeply 07/06 Update SLTechnology News&Howtos

How to use Trace-Event to solve the problem that the system can not sleep deeply

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article mainly shows you "how to use Trace-Event to solve the problem of deep sleep in the system". The content is simple and clear. I hope it can help you solve your doubts. Let me lead you to study and learn this article "how to use Trace-Event to solve the problem of deep sleep in the system".

Recently encountered a problem, the system can not sleep to c7s, can only sleep to c3. (c-state cannot reach c7s, c-state of cpu, c0 is running state, other states are idle state. The deeper the sleep, the higher the c-state value.)

At this time, the first feeling is that the system is busy. Use pert top to take a look at the processes and hotspot functions that consume cpu:

Perf top-E 100-- stdio > perf-top.txt 19.85% perf [.] _ symbols__insert 7.68% perf [.] Rb_next 4.60% libc-2.26.so [.] _ strcmp_sse2_unaligned 4.20% libelf-0.168.so [.] Gelf_getsym 3.92% perf [.] Dso__load_sym 3.86% libc-2.26.so [.] _ int_malloc 3.60% libc-2.26.so [.] _ libc_calloc 3.30% libc-2.26.so [.] Vfprintf 2.95% perf [.] Rb_insert_color 2.61% [kernel] [k] prepare_exit_to_usermode 2.51% perf [.] Machine__map_x86_64_entry_trampolines 2.31% perf [.] Symbol__new 2.22% [kernel] [k] do_syscall_64 2.11% libc-2.26.so [.] _ strlen_avx2

It is found that only the perf tool itself in the system consumes cpu: (

Then I wondered if it was some process in the system that kept cpu from sleeping until C7s. At this time, use trace event to monitor sched_switch events in the system. Use the trace-cmd tool to monitor sched_switch (process switchover) events on all cpu for 30 seconds:

# trace-cmd record-e sched:sched_switch-M-1 sleep 30 CPU0 data recorded at offset=0x63e000 102400 bytes in size CPU1 data recorded at offset=0x657000 8192 bytes in size CPU2 data recorded at offset=0x659000 20480 bytes in size CPU3 data recorded at offset=0x65e000 20480 bytes in size

Use trace-cmd report to check the monitoring results, but it is not intuitive to view such raw data, and there is no statistical information that a process has been switched to:

# trace-cmd report cpus=4 trace-cmd-19794 [225127.464466]: sched_switch: trace-cmd:19794 [120s] S = = > swapper/1:0 [120s] trace-cmd-19795 [003s] 225127.464601: sched_switch: trace-cmd:19795 [120s] = = > swapper/3:0 [120s] sleep-19796 [002s] 225127.464792: sched_switch: sleep:19796 [120s] = = > swapper/2:0 [120]-0 [003] 225127.471948: sched_switch: swapper/3:0 [120] R = = > rcu_sched:11 [120] rcu_sched-11 [003] 225127.471950: sched_switch: rcu_sched:11 [120] W = = > swapper/3:0 [120]-225127.479959: sched_switch: Swapper/3:0 [120] R = > rcu_sched:11 [120] rcu_sched-11 [003] 225127.479960: sched_switch: rcu_sched:11 [120] W = = > swapper/3:0 [120]-0 [003] 225127.487959: sched_switch: swapper/3:0 [120] R = = > rcu_sched:11 [120] rcu_sched-11 [003] 225127.487961: sched_switch: rcu_sched:11 [225127.487961] W = > swapper/3:0 [120]-0 [002] 225127.491959: sched_switch: swapper/2:0 [120] R = = > kworker/2:2:19735 [120] kworker/2:2-19735 [002] 225127.491972: sched_switch: kworker/2:2:19735 [120] W = = > swapper/2:0 [225127.491972]

Trace-cmd report's results are filtered using regular expressions, and then the statistics are sorted:

Trace-cmd report | grep-o'= = > [^]\ +:\?'| sort | uniq-c 3 = = > irqbalance:1034 3 = = > khugepaged:43 20 = > ksoftirqd/0:10 1 = = > ksoftirqd/1:18 18 = = > ksoftirqd/3:30 1 = > kthreadd:19798 1 = = > kthreadd:2 4 = = > kworker/0:0:19785 1 = > kworker/0:1:19736 5 = = > kworker/0:1: 19798 5 = > kworker/0:1H:364 53 = > kworker/0:2:19614 19 = > kworker/1:1:7665 30 = = > tuned:19498...

It was found that the suspicious thread tuned,30 was switched to run 30 times in a second, and the other threads were regular threads.

At this point, check to see if the tuned service is enabled in the system:

Indeed, the system opened the tuned service and pulled up a thread named tuned.

Check the configuration file for the tuned service:

Localhost:/home/jeff # tuned-adm active Current active profile: sap-hana localhost:/home/jeff # cat / usr/lib/tuned/sap-hana/tuned.conf [main] summary=Optimize for SAP NetWeaver, SAP HANA and HANA based products [cpu] force_latency = 70

It is found that with regard to cpu, the mandatory delay time is set to 70 seconds force_latency = 70, which is to optimize the HANA database.

How does force_latency work? after a search, it is found that this value is set to / dev/cpu_dma_latency.

Using lsof / dev/cpu_dma_latency, it is found that the tuned thread is indeed manipulating this file

# lsof / dev/cpu_dma_latency COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME tuned 18734 root 9w CHR 10,60 0t0 11400 / dev/cpu_dma_latency

And the Linux kernel documentation also describes the / dev/cpu_dma_latency file. If you want to write to it, you should not close the data after writing open, and if you release the file descriptor, it will return to the default value, which also confirms that the above lsof / dev/cpu_dma_latency has output results.

Https://github.com/torvalds/linux/blob/v5.8/Documentation/trace/coresight/coresight-cpu-debug.rst As specified in the PM QoS documentation the requested parameter will stay in effect until the file descriptor is released. For example: # exec 3 / dev/cpu_dma_latency; echo 0 > & 3. Do some work... ... # exec 3-

Check the contents of the / dev/cpu_dma_latency file, which is indeed 70, that is, (force_latency = 70)

Localhost:/home/jeff # cat / dev/cpu_dma_latency | hexdump-Cv 00000000 46 000000 | F... | localhost:/home/jeff # echo $((0x46)) 70

At this point, take a look at the description and delay time value of each sleep state of cpu in the system:

# cd / sys/devices/system/cpu/cpu0/cpuidle/ # for state in *; do echo-e\ "STATE: $state\ t\ DESC: $(cat $state/desc)\ t\ NAME: $(cat $state/name)\ t\ LATENCY: $(cat $state/latency)\ t\ RESIDENCY: $(cat $state/residency)" done

It is found that the delay time of C3 state is 33 microseconds, and that of C4 is 133 microseconds, so (force_latency = 70)

The system can only sleep to C 3. (the delay time is the time from the sleeping state to the running state)

STATE: state0 DESC: CPUIDLE CORE POLL IDLE NAME: POLL LATENCY: 0 RESIDENCY: 0 STATE: state1 DESC: MWAIT 0x00 NAME: C1 LATENCY: 2 RESIDENCY: 2 STATE: state2 DESC: MWAIT 0x01 NAME: C1E LATENCY: 10 RESIDENCY: 20 STATE: state3 DESC: MWAIT 0x10 NAME: C3 LATENCY: 33 RESIDENCY: 100 STATE: state4 DESC: MWAIT 0x20 NAME: C6 LATENCY: 133 RESIDENCY: 400 STATE: state5 DESC: MWAIT 0x32 NAME: C7s LATENCY: 166 RESIDENCY: 500

At this point, turn off the tuned service, and then check the value of / dev/cpu_dma_latency, which becomes the default value of 2000 seconds.

Localhost:/home/jeff # tuned-adm off localhost:/home/jeff # cat / dev/cpu_dma_latency | hexdump-Cv 00000000 00 94 35 77 |.. 5w | localhost:/home/jeff # echo $((0x77359400)) 2000000000

Then verify that the system can sleep to C7s at this time, and this problem is solved:)

To solve this problem, we mainly use the trace-event provided by the Linux kernel itself.

Therefore, any function can not be underestimated, the kernel is like this, generally looks very boring functions, after being polished out by some engineers with a very serious attitude, the potential is still very great

The above is all the contents of this article entitled "how to use Trace-Event to solve the problem of deep sleep in the system". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.