How to use Perf under Linux 07/09 Update SLTechnology News&Howtos

How to use Perf under Linux

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

Editor to share with you how to use Perf under Linux, I believe most people do not know much about it, so share this article for your reference, I hope you can learn a lot after reading this article, let's go to know it!

1. Background knowledge

1.1 hardware features related to performance tuning

Cache of hardware characteristics

Memory read and write is very fast, but still can not be compared with the instruction execution speed of the processor. In order to read instructions and data from memory, the processor needs to wait, which is very long as measured by the processor's time. Cache is a kind of SRAM, which reads and writes very fast and can match the processing speed of the processor. Therefore, by saving commonly used data in cache, the processor does not have to wait, thus improving performance. The size of Cache is generally very small, and making full use of cache is a very important part of software tuning.

Pipelined hardware features, superscalar architecture, out-of-order execution

One of the most effective ways to improve performance is parallelism. Processors are also designed in parallel as much as possible, such as pipelining, superscalar architecture, and out-of-order execution.

The processor needs to process an instruction in several steps, such as fetching the instruction first, then completing the operation, and * outputting the calculation results to the bus. When the * instruction is in operation, the second instruction is already fetching; when the instruction outputs the result, the second instruction can be operated again, and the instruction on the third day has already been fetched. In the long run, it looks like three instructions are being executed at the same time. Within the processor, this can be seen as a three-stage pipeline.

Superscalar (superscalar) refers to a pipelined machine architecture that emits multiple instructions in a clock cycle, such as Intel's Pentium processor, which has two execution units that allow two instructions to be executed in one clock cycle.

Within the processor, the processing steps and clock cycles required by different instructions are different. If the execution order of the program is strictly followed, then the processor pipeline can not be fully utilized. Therefore, instructions may be executed out of order.

The above three parallel technologies have a basic requirement for the executed instructions, that is, the adjacent instructions are not dependent on each other. If an instruction needs to rely on the execution result data of the previous instruction, then pipeline has no effect because the second instruction must wait for the * instruction to complete. Therefore, good software must avoid this kind of code generation as much as possible.

Branch prediction of hardware characteristics

Branch instructions have a great impact on software performance. Especially when the processor is pipelined, it is assumed that the pipeline has three levels, and the current instructions entering the pipeline are branch instructions. Assuming that the processor reads instructions sequentially, then if the result of the branch is to jump to other instructions, then the next two instructions prefetched by the processor pipeline will be discarded, thus affecting performance. To this end, many processors provide branch prediction, which predicts based on the history of the same instruction and reads the next instruction that is most likely to be read, rather than sequentially.

Branch prediction has some requirements for software structure. For repetitive branch instruction sequences, branch prediction hardware can get better prediction results, but for program structures such as switch case, it is often unable to get ideal prediction results.

Several processor features described above have a great impact on the performance of the software. however, the profiler mode, which relies on clock sampling periodically, can not reveal the program's use of these processor hardware features. In view of this situation, processor manufacturers add PMU units, namely performance monitor unit, to the hardware. PMU allows the software to set counter for a hardware event, after which the processor begins to count the number of times the event occurs, and an interrupt occurs when the number of times exceeds the value set in the counter. For example, when cache miss reaches a certain value, PMU can produce corresponding interrupts.

By capturing these interrupts, you can examine the program's efficiency in using these hardware features.

1.2 Tracepoints

Tracepoint are hook scattered in kernel source code that, once enabled, can be triggered when a particular code is run, a feature that can be used by various trace/debug tools. Perf is one of the users of this feature.

If you want to know the behavior of the kernel memory management module while the application is running, you can take advantage of the tracepoint lurking in the slab allocator. Perf is notified when the kernel runs to these tracepoint. Perf records the events generated by tracepoint and generates reports. By analyzing these reports, the tuner can understand the details of the kernel while the program is running and make a more accurate diagnosis of performance symptoms.

Introduction to 2.Perf

Perf is a system performance optimization tool included in the Linux kernel. The principle is as follows:

Get/Set performance counters in the PMU registers of CPU to obtain information such as instructions executed, cache-missed suffered, branches mispredicted, etc. Linux kernel does a series of abstractions to these registers, so you can view Sample information by process, by CPU or by counter group.

With Perf, applications can take advantage of PMU,tracepoint and special counters in the kernel for performance statistics. It can not only analyze the performance problems of specified applications (per thread), but also analyze the performance problems of the kernel. Of course, it can also analyze the application code and the kernel at the same time, so as to fully understand the performance bottlenecks in the application.

Perf can be used not only for application performance statistics and analysis, but also for kernel code performance statistics and analysis. Thanks to its excellent architecture design, more and more new features have been added to Perf, making it a multi-functional performance statistics tool set.

Basic use of 3.perf

The basic principle of performance tuning tools such as perf,Oprofile is to sample the monitored object. The simplest case is to sample according to the tick interrupt, that is, the sampling point is triggered within the tick interrupt, and the context of the program is judged in the sampling point. If a program spends 90% of its time on the function foo (), then 90% of the sampling points should fall in the context of the function foo (). Luck is elusive, but I think the above inference is more reliable as long as the sampling frequency is high enough and the sampling time is long enough. Therefore, through tick trigger sampling, we can understand which parts of the program are the most time-consuming, thus focusing on the analysis. With a little expansion of the idea, we can find that changing the trigger conditions of sampling enables us to obtain different statistics: the distribution of running time of the program can be obtained by taking the point in time (such as tick) as the event trigger sampling. By triggering sampling with cache miss events, we can know the distribution of cache miss, that is, in which program code cache failures often occur. And so on.

So let's first take a look at the events in perf that trigger sampling.

3.1 Perf list,perf event

Use the perf list command to list all the events that can trigger perf sampling points. such as

# perf list List of pre-defined events (to be used in-e): cpu-cycles OR cycles [Hardware event] instructions [Hardware event] … Cpu-clock [Software event] task-clock [Software event] context-switches OR cs [Software event]... Ext4:ext4_allocate_inode [Tracepoint event] kmem:kmalloc [Tracepoint event] module:module_load [Tracepoint event] workqueue:workqueue_execution [Tracepoint event] sched:sched_ {wakeup,switch} [Tracepoint event] syscalls:sys_ {enter,exit} _ epoll_wait [Tracepoint event] …

Different systems list different results. In the 2.6.35 kernel, the list is already quite long, but no matter how many there are, we can divide them into three categories:

Hardware Event are events generated by PMU hardware, such as cache *. When you need to know how the program uses hardware features, you need to sample these events.

Software Event is an event generated by kernel software, such as process switching, tick number, etc.

Tracepoint event is an event triggered by a static tracepoint in the kernel. These tracepoint are used to determine the details of the kernel's behavior while the program is running, such as the number of times the slab allocator is allocated.

Each of the above events can be used to sample and generate a statistic, and to this day, there is no documentation to explain the meaning of each event in detail.

3.2 Perf stat

Faced with a problem program, * uses a top-down strategy. First take a general look at the various statistical events when the program is running, and then delve into the details in some directions. Don't dive into the trivial details at once, it will be blind.

Some programs are slow because of the large amount of computation, so they should use CPU for calculation most of the time, which is called CPU bound type; some programs are slow because of too much IO, so their CPU utilization should not be high, which is called IO bound type; the tuning of CPU bound program is different from that of IO bound.

If you agree with these claims, Perf stat should be a tool you use. It provides the overall running situation and summary data of the debugged program in a concise and general way.

The following demonstrates the output of perf stat for program T1:

$perf stat. / T1 Performance counter stats for'. / T1 percent:

262.738415 task-clock-msecs # 0.991 CPUs 2 context-switches # 0.000 M/sec 1 CPU-migrations # 0.000 M/sec 81 page-faults # 0.000 M/sec 9478851 cycles # 36.077 M/sec (scaled from 98.24%) 6771 instructions # 0.001 IPC (scaled from 98.99%) 111114049 branches # 422.908 M/sec (scaled from 99.37%) 8495 branch-misses # 0.008% Scaled from 95.91) 12152161 cache-references # 46.252 M/sec (scaled from 96.16%) 7245338 cache-misses # 27.576 M/sec (scaled from 95.49%) 0.265238069 seconds time elapsed

The above tells us that the program T1 is a CPU bound because task-clock-msecs is close to 1.

Tuning T1 should find the hotspot (that is, the most time-consuming snippet) and see if it can improve the efficiency of the hotspot code.

By default, in addition to task-clock-msecs, perf stat gives several other statistics that are most commonly used:

Task-clock-msecs:CPU utilization, which is high, indicating that most of the program's time is spent on CPU calculations rather than IO. Context-switches: the number of process switches, which records how many process switches have occurred during the running of the program. Frequent process switching should be avoided. Cache-misses: the overall cache utilization during the running of the program. If this value is too high, the cache of the program is not well utilized CPU-migrations: it indicates how many CPU migrations occurred during the process T1 running, that is, the scheduler transferred from one CPU to another CPU to run. Cycles: processor clock, a single machine instruction may require multiple cycles, Instructions: the number of machine instructions. IPC: the ratio of Instructions/Cycles. The higher the value, the better, indicating that the program makes full use of the features of the processor. Cache-references: the number of cache * Cache-misses: the number of cache failures.

By specifying the-e option, you can change the default events for perf stat (events, as explained in the previous section, can be seen through perf list). If you already have a lot of tuning experience, you may use the-e option to view special events of interest to you.

3.3perf Top

Perf top is used to display the performance statistics of the current system in real time. This command is mainly used to observe the current state of the entire system, for example, you can view the most time-consuming kernel function or a user process of the current system by looking at the output of the command.

Here is the possible output of perf top:

PerfTop: 705 irqs/sec kernel:60.4% [1000Hz cycles]-sampl pcnt function DSO 1503.00 49.2% t2 72.00 2.2% pthread_mutex_lock / lib/libpthread-2.12.so 68.00 2.1% delay _ tsc [kernel.kallsyms] 55.00 1.7% aes_dec_blk [aes_i586] 55.00 1.7% drm_clflush_pages [drm] 52.00 1.6% system_call [kernel.kallsyms] 49.00 1.5% _ _ memcpy_ssse3 / lib/libc-2.12.so 48.00 1.4% _ strstr_ia32 / lib/libc-2.12.so 46.00 1.4% unix _ poll [kernel.kallsyms] 42.00 1.3% _ _ ieee754_pow / lib/libm-2.12.so 41.00 1.2% do_select [kernel.kallsyms] 40.00 1.2% pixman_rasterize_edges libpixman-1.so.0.18.0 37.00 1.1% _ raw_spin_lock_irqsave [kernel.kallsyms] 36.00 1.1% _ int_malloc / lib/libc-2.12.so ^ C

It is easy to find that T2 is a suspicious program that requires attention. But the modus operandi is too simple: unscrupulous waste of CPU. So we don't have to do anything else to find out what the problem is. But in real life, programs that affect performance are generally not so stupid, so we often need to use other perf tools for further analysis.

By adding the-e option, you can list the TopN processes / functions that caused other events. For example,-e cache-miss is used to see who causes the most cache miss.

3.4Using perf record to interpret report

Lenny@hbt:~/test$ perf record. / a.out [perf record: Woken up 1 times to write data] [perf record: Captured and wrote 0.015 MB perf.data (~ 656 samples)] lenny@hbt:~/test$ perf report Events: 12 cycles 62.29% a.out a.out [.] Longa 35.17% a.out [kernel.kallsyms] [k] unmap_vmas 1.76% a.out [kernel.kallsyms] [k] _ schedule 0.75% a.out [kernel.kallsyms] [k] _ cache_alloc 0.03% a.out [kernel.kallsyms] [k] native_write_msr_safe

Add the-g option to perf record to record the call relationship of the function, which is shown as follows:

Lenny@hbt:~/test$ perf record-g. / a.out [perf record: Woken up 1 times to write data] [perf record: Captured and wrote 0.016 MB perf.data (~ 701 samples)] lenny@hbt:~/test$ perf report Events: 14 cycles -87.12% a.out a.out [.] Longa-longa-52 .91% fun1 main _ _ libc_start_main-47.09% fun2 Main _ _ libc_start_main + 9.12% a.out [kernel.kallsyms] [k] vm_normal_page + 3.48% a.out [kernel.kallsyms] [k] _ cond_resched + 0.28% a.out [kernel.kallsyms] [k] native_write_msr_safe

* the following example of using PMU T3 refers to the article "Branch and Loop Reorganization to Prevent Mispredicts". This example examines the program's utilization of Pentium processor branch prediction. As mentioned earlier, branch prediction can significantly improve processor performance, while branch prediction failure significantly reduces processor performance. First, an example of BTB failure is given:

Listing 3. An example program with BTB failure

/ / test.c # include # include void foo () {int iMagnej; for (iMag0; I

< 10; i++) j+=2; } int main(void) { int i; for(i = 0; i< 100000000; i++) foo(); return 0; } 用 gcc 编译生成测试程序 t3: gcc – o t3 – O0 test.c 用 perf stat 考察分支预测的使用情况： [lm@ovispoly perf]$ ./perf stat ./t3 Performance counter stats for './t3': 6240.758394 task-clock-msecs # 0.995 CPUs 126 context-switches # 0.000 M/sec 12 CPU-migrations # 0.000 M/sec 80 page-faults # 0.000 M/sec 17683221 cycles # 2.834 M/sec (scaled from 99.78%) 10218147 instructions # 0.578 IPC (scaled from 99.83%) 2491317951 branches # 399.201 M/sec (scaled from 99.88%) 636140932 branch-misses # 25.534 % (scaled from 99.63%) 126383570 cache-references # 20.251 M/sec (scaled from 99.68%) 942937348 cache-misses # 151.093 M/sec (scaled from 99.58%) 6.271917679 seconds time elapsed 可以看到 branche-misses 的情况比较严重，25% 左右。我测试使用的机器的处理器为 Pentium4，其 BTB 的大小为 16。而 test.c 中的循环迭代为 20 次，BTB 溢出，所以处理器的分支预测将不准确。对于上面这句话我将简要说明一下，但关于 BTB 的细节，请阅读参考文献 [6]。for 循环编译成为 IA 汇编后如下：清单 4. 循环的汇编 // C code for ( i=0; i < 20; i++ ) { … } //Assembly code; mov esi, data mov ecx, 0 ForLoop: cmp ecx, 20 jge EndForLoop … add ecx, 1 jmp ForLoop EndForLoop: 可以看到，每次循环迭代中都有一个分支语句 jge，因此在运行过程中将有 20 次分支判断。每次分支判断都将写入 BTB，但 BTB 是一个 ring buffer，16 个 slot 写满后便开始覆盖。假如迭代次数正好为 16，或者小于 16，则完整的循环将全部写入 BTB。清单 5. 没有 BTB 失效的代码 #include #include void foo() { int i,j; for(i=0; i< 10; i++) j+=2; } int main(void) { int i; for(i = 0; i< 100000000; i++) foo(); return 0; } 此时再次用 perf stat 采样得到如下结果： [lm@ovispoly perf]$ ./perf stat ./t3 Performance counter stats for './t3: 2784.004851 task-clock-msecs # 0.927 CPUs 90 context-switches # 0.000 M/sec 8 CPU-migrations # 0.000 M/sec 81 page-faults # 0.000 M/sec 33632545 cycles # 12.081 M/sec (scaled from 99.63%) 42996 instructions # 0.001 IPC (scaled from 99.71%) 1474321780 branches # 529.569 M/sec (scaled from 99.78%) 49733 branch-misses # 0.003 % (scaled from 99.35%) 7073107 cache-references # 2.541 M/sec (scaled from 99.42%) 47958540 cache-misses # 17.226 M/sec (scaled from 99.33%) 3.002673524 seconds time elapsed Branch-misses 减少了。本例只是为了演示 perf 对 PMU 的使用，本身并无意义，关于充分利用 processor 进行调优可以参考 Intel 公司出品的调优手册，其他的处理器可能有不同的方法，还希望读者明鉴。 3.5 使用 tracepoint 当 perf 根据 tick 时间点进行采样后，人们便能够得到内核代码中的 hot spot。那什么时候需要使用 tracepoint 来采样呢?我想人们使用 tracepoint 的基本需求是对内核的运行时行为的关心，如前所述，有些内核开发人员需要专注于特定的子系统，比如内存管理模块。这便需要统计相关内核函数的运行情况。另外，内核行为对应用程序性能的影响也是不容忽视的：以之前的遗憾为例，假如时光倒流，我想我要做的是统计该应用程序运行期间究竟发生了多少次系统调用。在哪里发生的?下面我用 ls 命令来演示 sys_enter 这个 tracepoint 的使用： [root@ovispoly /]# perf stat -e raw_syscalls:sys_enter ls bin dbg etc lib media opt root selinux sys usr boot dev home lost+found mnt proc sbin srv tmp var Performance counter stats for 'ls': 101 raw_syscalls:sys_enter 0.003434730 seconds time elapsed [root@ovispoly /]# perf record -e raw_syscalls:sys_enter ls [root@ovispoly /]# perf report Failed to open .lib/ld-2.12.so, continuing without symbols # Samples: 70 # # Overhead Command Shared Object Symbol # ........ ............... ............... ...... # 97.14% ls ld-2.12.so [.] 0x0000000001629d 2.86% ls [vdso] [.] 0x00000000421424 # # (For a higher level overview, try: perf report --sort comm,dso) # 这个报告详细说明了在 ls 运行期间发生了多少次系统调用 ( 上例中有 101 次 )，多数系统调用都发生在哪些地方 (97% 都发生在 ld-2.12.so 中 )。有了这个报告，或许我能够发现更多可以调优的地方。比如函数 foo() 中发生了过多的系统调用，那么我就可以思考是否有办法减少其中有些不必要的系统调用。您可能会说 strace 也可以做同样事情啊，的确，统计系统调用这件事完全可以用 strace 完成，但 perf 还可以干些别的，您所需要的就是修改 -e 选项后的字符串。罗列 tracepoint 实在是不太地道，本文当然不会这么做。但学习每一个 tracepoint 是有意义的，类似背单词之于学习英语一样，是一项缓慢痛苦却不得不做的事情。 3.6 perf probe tracepoint 是静态检查点，意思是一旦它在哪里，便一直在那里了，您想让它移动一步也是不可能的。内核代码有多少行?我不知道，100 万行是至少的吧，但目前 tracepoint 有多少呢?我***胆的想象是不超过 1000 个。所以能够动态地在想查看的地方插入动态监测点的意义是不言而喻的。 Perf 并不是***个提供这个功能的软件，systemTap 早就实现了。但假若您不选择 RedHat 的发行版的话，安装 systemTap 并不是件轻松愉快的事情。perf 是内核代码包的一部分，所以使用和维护都非常方便。我使用的 Linux 版本为 2.6.33。因此您自己做实验时命令参数有可能不同。 [root@ovispoly perftest]# perf probe schedule:12 cpu Added new event: probe:schedule (on schedule+52 with cpu) You can now use it on all perf tools, such as: perf record -e probe:schedule -a sleep 1 [root@ovispoly perftest]# perf record -e probe:schedule -a sleep 1 Error, output file perf.data exists, use -A to append or -f to overwrite. [root@ovispoly perftest]# perf record -f -e probe:schedule -a sleep 1 [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.270 MB perf.data (~11811 samples) ] [root@ovispoly perftest]# perf report # Samples: 40 # # Overhead Command Shared Object Symbol # ........ ............... ................. ...... # 57.50% init 0 [k] 0000000000000000 30.00% firefox [vdso] [.] 0x0000000029c424 5.00% sleep [vdso] [.] 0x00000000ca7424 5.00% perf.2.6.33.3-8 [vdso] [.] 0x00000000ca7424 2.50% ksoftirqd/0 [kernel] [k] 0000000000000000 # # (For a higher level overview, try: perf report --sort comm,dso) # 上例利用 probe 命令在内核函数 schedule() 的第 12 行处加入了一个动态 probe 点，和 tracepoint 的功能一样，内核一旦运行到该 probe 点时，便会通知 perf。可以理解为动态增加了一个新的 tracepoint。此后便可以用 record 命令的 -e 选项选择该 probe 点，***用 perf report 查看报表。如何解读该报表便是见仁见智了，既然您在 shcedule() 的第 12 行加入了 probe 点，想必您知道自己为什么要统计它吧? 3.7 Perf sched 调度器的好坏直接影响一个系统的整体运行效率。在这个领域，内核黑客们常会发生争执，一个重要原因是对于不同的调度器，每个人给出的评测报告都各不相同，甚至常常有相反的结论。因此一个权威的统一的评测工具将对结束这种争论有益。Perf sched 便是这种尝试。 Perf sched 有五个子命令： perf sched record # low-overhead recording of arbitrary workloads perf sched latency # output per task latency metrics perf sched map # show summary/map of context-switching perf sched trace # output finegrained trace perf sched replay # replay a captured workload using simlated threads 用户一般使用’ perf sched record ’收集调度相关的数据，然后就可以用’ perf sched latency ’查看诸如调度延迟等和调度器相关的统计数据。其他三个命令也同样读取 record 收集到的数据并从其他不同的角度来展示这些数据。下面一一进行演示。 perf sched record sleep 10 # record full system activity for 10 seconds perf sched latency --sort max # report latencies sorted by max ------------------------------------------------------------------------------------- Task | Runtime ms | Switches | Average delay ms | Maximum delay ms | ------------------------------------------------------------------------------------- :14086:14086 | 0.095 ms | 2 | avg: 3.445 ms | max: 6.891 ms | gnome-session:13792 | 31.713 ms | 102 | avg: 0.160 ms | max: 5.992 ms | metacity:14038 | 49.220 ms | 637 | avg: 0.066 ms | max: 5.942 ms | gconfd-2:13971 | 48.587 ms | 777 | avg: 0.047 ms | max: 5.793 ms | gnome-power-man:14050 | 140.601 ms | 434 | avg: 0.097 ms | max: 5.367 ms | python:14049 | 114.694 ms | 125 | avg: 0.120 ms | max: 5.343 ms | kblockd/1:236 | 3.458 ms | 498 | avg: 0.179 ms | max: 5.271 ms | Xorg:3122 | 1073.107 ms | 2920 | avg: 0.030 ms | max: 5.265 ms | dbus-daemon:2063 | 64.593 ms | 665 | avg: 0.103 ms | max: 4.730 ms | :14040:14040 | 30.786 ms | 255 | avg: 0.095 ms | max: 4.155 ms | events/1:8 | 0.105 ms | 13 | avg: 0.598 ms | max: 3.775 ms | console-kit-dae:2080 | 14.867 ms | 152 | avg: 0.142 ms | max: 3.760 ms | gnome-settings-:14023 | 572.653 ms | 979 | avg: 0.056 ms | max: 3.627 ms | ... ----------------------------------------------------------------------------------- TOTAL: | 3144.817 ms | 11654 | --------------------------------------------------- 上面的例子展示了一个 Gnome 启动时的统计信息。各个 column 的含义如下： Task: 进程的名字和 pid Runtime: 实际运行时间 Switches: 进程切换的次数 Average delay: 平均的调度延迟 Maximum delay: ***延迟这里最值得人们关注的是 Maximum delay，一般从这里可以看到对交互性影响***的特性：调度延迟，如果调度延迟比较大，那么用户就会感受到视频或者音频断断续续的。其他的三个子命令提供了不同的视图，一般是由调度器的开发人员或者对调度器内部实现感兴趣的人们所使用。首先是 map: N1 O1 . . . S1 . . . B0 . *I0 C1 . M1 . 23002.773423 secs N1 O1 . *Q0 . S1 . . . B0 . I0 C1 . M1 . 23002.773423 secs N1 O1 . Q0 . S1 . . . B0 . *R1 C1 . M1 . 23002.773485 secs N1 O1 . Q0 . S1 . *S0 . B0 . R1 C1 . M1 . 23002.773478 secs *L0 O1 . Q0 . S1 . S0 . B0 . R1 C1 . M1 . 23002.773523 secs L0 O1 . *. . S1 . S0 . B0 . R1 C1 . M1 . 23002.773531 secs L0 O1 . . . S1 . S0 . B0 . R1 C1 *T1 M1 . 23002.773547 secs T1 =>

Irqbalance:2089 L0 O1. . . S1. S0. * P0. R1 C1 T1 M1. 23002.773549 secs * N1 O1. . . S1. S0. P0. R1 C1 T1 M1. 23002.773566 secs N1 O1. . . * J0. S0. P0. R1 C1 T1 M1. 23002.773571 secs N1 O1. . . J0. S 0 * B 0 P 0. R1 C1 T1 M1. 23002.773592 secs N1 O1. . . J0. * U0B0P0. R1 C1 T1 M1. 23002.773582 secs N1 O1. . . * S1. U0 B0 P0. R1 C1 T1 M1. 23002.773604 secs

The asterisk indicates the CPU where the scheduling event occurs.

The dot indicates that the CPU is IDLE.

The advantage of Map is that it provides a general view, summarizing hundreds of scheduling events and showing the distribution of system tasks between CPU. If there is a bad scheduling migration, such as a task that is not migrated to idle CPU in time but is migrated to other busy CPU, the problem of this kind of scheduler can be seen at a glance in the map report. If map provides a high-level summary of the overall report, then trace provides the most detailed, detailed report of the * layer.

Perf replay, a tool designed specifically for scheduler developers, attempts to replay the scheduling scenarios recorded in the perf.data file. In many cases, if the average user discovers the strange behavior of the scheduler, they cannot accurately describe the scenario in which the situation occurred, or some test scenarios are not easy to reproduce, or just for the purpose of "laziness", using perf replay,perf will simulate the scenario in perf.data, so that developers do not have to spend a lot of time to reproduce the past, which is especially beneficial to the debugging process, because it needs to be repeated. Whether repeating the new changes over and over again can improve the problems found in the original scheduling scenario.

The following is an example of a replay execution:

$perf sched replay run measurement overhead: 3771 nsecs sleep measurement overhead: 66617 nsecs the run test took 999708 nsecs the sleep test took 1097207 nsecs nr_run_events: 200221 nr_sleep_events: 200235 nr_wakeup_events: 100130 task 0 (perf: 13519), nr_events: 148task 1 (perf: 13520), nr_events: 200037 task 2 (pipe-test-100k: 13521), nr_events: 300090 task 3 (ksoftirqd/0: 4), nr_events: 8 task 4 (swapper: 0), nr_events: 170 task 5 (gnome-power-man: 3192) Nr_events: 3 task 6 (gdm-simple-gree: 3234), nr_events: 3 task 7 (Xorg: 3122), nr_events: 5 task 8 (hald-addon-stor: 2234), nr_events: 27 task 9 (ata/0: 2234), nr_events: 29 task 10 (scsi_eh_4: 704), nr_events: 37 task 11 (events/1: 8), nr_events: 3 task 12 (events/0: 7), nr_events: 6 task 13 (flush-8:0: 6980) Nr_events: 20-# 1: 2038.157, ravg: 2038.16, cpu: 0.09 / 0.09 # 2: 2042.153, ravg: 2038.56, cpu: 0.11 / 0.09 ^ C

3.8 perf bench

In addition to the scheduler, there are many times when people need to measure the impact of their work on system performance. Benchmark is a standard way to measure performance, and having a recognized benchmark for the same goal will be very helpful to the work of "improving kernel performance."

So far, as far as I know, perf bench provides three benchmark:

1.Sched message

Sched message is ported from the classic test program hackbench to measure the performance, overhead and scalability of the scheduler. The benchmark starts N reader/sender processes or thread pairs and reads and writes concurrently through IPC (socket or pipe). In general, N is increasing to measure the scalability of the scheduler. The usage and use of Sched message is the same as hackbench.

2.Sched Pipe

Sched pipe is migrated from Ingo Molnar's pipe-test-1m.c. The original program of Ingo was designed to test the performance and fairness of different schedulers. Its working principle is very simple: the two processes desperately send 1000000 integers to each other through pipe, process A to B, and B to A. Because An and B depend on each other, if the scheduler is unfair and is better for A than B, then An and B as a whole will take longer.

3.Mem memcpy

This is a benchmark that executes memcpy written by Hitoshi Mitake, the author of perf bench. This test measures the time it takes for a memcpy () function to copy 1m data. I still don't understand the usage scenario of the benchmark. Maybe it's an example of how to use the perf bench framework to develop more benchmark.

These three benchmark show us a possible future: people from different languages, skin colors, and backgrounds will use the same benchmark in the future, as long as they have a copy of the Linux kernel code.

3.9 perf lock

Locking is a method of kernel synchronization, and once locked, other kernel execution paths that are ready to be locked must wait, reducing parallelism. Therefore, special analysis of locks should be an important task of tuning.

3.10 perf Kmem

Perf Kmem specifically collects events related to the kernel slab allocator. Such as memory allocation, release and so on. It can be used to study where a program allocates a large amount of memory, or where fragmentation occurs and other memory management-related issues.

Perf kmem and perf lock are actually special cases of perf tracepoint, and you can use Perf record-e kmem: or perf record-e lock: to do the same thing. But what is important is that these tools summarize and analyze the raw data internally, so that they can produce statistical reports with clearer and more useful information.

3.11 Perf timechart

Many perf commands are designed to debug a single program or for a single purpose. Sometimes, performance problems are not caused by a single cause and need to be looked at from all angles. To do this, people often need to make use of a variety of tools, such as top,vmstat,oprofile or perf. This is very troublesome.

The above is all the contents of the article "how to use Perf under Linux". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.