[Linux] detailed explanation of Perf tool 07/03 Update SLTechnology News&Howtos

[Linux] detailed explanation of Perf tool

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

Original address: http://blog.csdn.net/zhangskd/article/details/37902159

Starting from the 2.6.31 kernel, the linux kernel comes with a performance analysis tool perf, which can find hotspots at the function level and instruction level.

Perf

Performance analysis tools for Linux.

Performance counters for Linux are a new kernel-based subsystem that provide a framework for all things

Performance analysis. It covers hardware level (CPU/PMU, Performance Monitoring Unit) features and

Software features (software counters, tracepoints) as well.

Perf is a performance profiling (profiling) tool built into the Linux kernel source tree.

It is based on the principle of event sampling, based on performance events, and supports performance analysis of processor-related performance indicators and operating system-related performance indicators.

It is often used to find performance bottlenecks and locate hot codes.

The CPU cycle (cpu-cycles) is the default performance event. The so-called CPU cycle refers to the minimum unit of time that CPU can recognize, usually a few billionths of a second.

Is the time it takes for CPU to execute the simplest instructions, such as reading the contents of a register, also known as clock tick.

Perf is a toolset with 22 subtools. Here are the 5 most commonly used tools:

Perf-list

Perf-stat

Perf-top

Perf-record

Perf-report

Perf-list

Perf-list is used to view the performance events supported by perf, both software and hardware.

List all symbolic event types.

Perf list [hw | sw | cache | tracepoint | event_glob]

(1) Distribution of performance events

Hw:Hardware event,9

Sw:Software event,9

Cache:Hardware cache event,26

Tracepoint:Tracepoint event,775

Sw is actually a kernel counter and has nothing to do with hardware.

Hw and cache are related to CPU architecture and depend on specific hardware.

Tracepoint is a kernel-based ftrace that is only supported by kernel versions with mainline 2.6.3x or above.

(2) specify a performance event (with its properties)

-e: U / / userspace

-e: K / / kernel

-e: h / / hypervisor

-e: G / / guest counting (in KVM guests)

-e: h / / host counting (not in KVM guests)

(3) use examples

Displays the functions in the kernel and modules that consume the most CPU cycles:

# perf top-e cycles:k

Displays the function that allocates the most caches:

# perf top-e kmem:kmem_cache_alloc

Perf-top

For a specified performance event (default is CPU cycle), displays the function or instruction that consumes the most.

System profiling tool.

Generates and displays a performance counter profile in real time.

Perf top [- e |-- event=EVENT] []

Perf top is mainly used to analyze the heat of each function on a performance event in real time, and can quickly locate hot functions, including application functions,

Module functions and kernel functions can even locate hot instructions. The default performance event is cpu cycles.

(1) output format

# perf top

[plain] view plain copy Samples: 1m of event 'cycles', Event count (approx.): 738913914905.44% perf [.] 0x00000000000232564.86% [kernel] [k] _ spin_lock2.43% [kernel] [k] _ spin_lock_bh2.29% [kernel] [k] _ spin_lock_irqsave1.77% [kernel] [k] _ dumblookup1.55% libc-2.12.so [.] _ strcmp_sse421.43% nginx [.] Ngx_vslprintf1.37% [kernel] [k] tcp_poll

The first column: the percentage of performance events raised by symbols, which by default refers to the percentage of cpu cycles occupied.

The second column: the DSO (Dynamic Shared Object) where the symbol is located, which can be applications, kernels, dynamic link libraries, modules.

The third column: the type of DSO. [.] Indicates that this symbol belongs to a user-mode ELF file, including executable files and dynamic link libraries. [K] indicates that this symbol belongs to the kernel or module.

The fourth column: symbolic name. Some symbols cannot be resolved to function names and can only be represented by addresses.

(2) commonly used interactive commands

H: show help

UP/DOWN/PGUP/PGDN/SPACE: up and down and turn the page.

A:annotate current symbol, annotate the current symbol. It can give the notes of assembly language and the sampling rate of each instruction.

D: filter out all symbols that do not belong to this DSO. It is very convenient to view symbols of the same category.

P: save the current information to perf.hist.N.

(3) Common command line parameters

-e: indicates the performance event to analyze.

-p: Profile events on existing Process ID (comma sperated list). Analyze only the target process and the threads it creates.

-k: Path to vmlinux. Required for annotation functionality. The path to the kernel image with a symbolic table.

-K: symbols that belong to kernels or modules are not displayed.

-U: symbols that belong to user-mode programs are not displayed.

-d: the refresh period of the interface. The default is 2s, because perf top reads performance data from the memory area of mmap by default every 2s.

-G: get the call diagram of the function.

Perf top-G [fractal], the path probability is a relative value, which adds up to 100%, and the call order is from bottom to top.

Perf top-G graph, whose path probability is absolute, adds up to the heat of the function.

(4) use examples

# perf top / / default configuration

# perf top-G / / get the call diagram

# perf top-e cycles / / specify performance event

# perf top-p 23015author32476 / / View the cpu cycles usage of these two processes

# perf top-s comm,pid,symbol / / displays the process name and process number that called symbol

# perf top-- comms nginx,top / / displays only symbols that belong to the specified process

# perf top-- symbols kfree / / displays only the specified symbol

Perf-stat

Used to analyze the performance profile of the specified program.

Run a command and gather performance counter statistics.

Perf stat [- e |-- event=EVENT] [- a]

Perf stat [- e |-- event=EVENT] [- a]-[]

(1) output format

# perf stat ls

[plain] view plain copy Performance counter stats for 'ls':0.653782 task-clock # 0.691 CPUs utilized0 context-switches # 0.000 K/sec0 CPU-migrations # 0.000 K/sec247 page-faults # 0.378 M/sec1625426 cycles # 2.486 GHz1050293 stalled-cycles-frontend # 64.62% frontend cycles idle838781 stalled-cycles-backend # 51.60% backend cycles idle1055735 instructions # 0.65 insns per cycle# 0.99 stalled cycles per insn210587 branches # 322.106 M/sec10809 branch-misses # 5.13% of all branches0.000945883 seconds time elapsed

The output includes the execution time of ls, as well as statistics for 10 performance events.

Task-clock: the actual processor time taken by the task, in ms. CPUs utilized = occupancy rate of task-clock / time elapsed,CPU.

Context-switches: the number of contextual switches.

CPU-migrations: number of processor migrations. In order to maintain the load balance of multiple processors, Linux will change a task from a CPU under certain conditions

Migrate to another CPU.

Page-faults: the number of page fault exceptions. When the page requested by the application has not been created, the requested page is not in memory, or although the requested page is included

In memory, but a page fault exception will be triggered when the mapping between the physical address and the virtual address has not been established. In addition, TLB misses, and page access permissions do not match.

A page fault exception can also be triggered by situations such as.

Cycles: the number of processor cycles consumed. If the cpu cycles used by ls is regarded as a processor, then its main frequency is 2.486GHz.

It can be calculated with cycles / task-clock.

Stalled-cycles-frontend: skip it.

Stalled-cycles-backend: skip it.

Instructions: how many instructions were executed. The average number of instructions executed by IPC for each cpu cycle.

Branches: the number of branch instructions encountered. Branch-misses is the number of branch instructions that are incorrectly predicted.

(2) Common parameters

-p:stat events on existing process id (comma separated list). Analyze only the target process and the threads it creates.

-a:system-wide collection from all CPUs. Collect performance data from all CPU.

-r:repeat command and print average + stddev (max: 100). Repeat the command to find the average.

-C:Count only on the list of CPUs provided (comma separated list) to collect performance data from the specified CPU.

-v:be more verbose (show counter open errors, etc) to show more performance data.

-n:null run-don't start any counters, showing only the execution time of the task.

-x SEP: specifies the delimiter of the output column.

-o file: specifies the output file, and-- append specifies the append mode.

-- pre: a program that executes before executing the target program.

-- post: a program that executes after the target program is executed.

(3) use examples

Execute the program 10 times and give the ratio of standard deviation to expectation:

# perf stat-r 10 ls > / dev/null

Display more detailed information:

# perf stat-v ls > / dev/null

Displays only the task execution time, not the performance counters:

# perf stat-n ls > / dev/null

Give the information on each CPU separately:

# perf stat-a-A ls > / dev/null

How many system calls were performed by the ls command:

# perf stat-e syscalls:sys_enter ls

Perf-record

Collect sampling information and record it in a data file.

The data file can then be analyzed by other tools (perf-report), and the result is similar to that of perf-top.

Run a command and record its profile into perf.data.

This command runs a command and gathers a performance counter profile from it, into perf.data

Without displaying anything. This file can then be inspected later on, using perf report.

(1) Common parameters

-e:Select the PMU event.

-a:System-wide collection from all CPUs.

-p:Record events on existing process ID (comma separated list).

-A:Append to the output file to do incremental profiling.

-f:Overwrite existing data file.

-o:Output file name.

-g:Do call-graph (stack chain/backtrace) recording.

-C:Collect samples only on the list of CPUs provided.

(2) use examples

Record the performance data of the nginx process:

# perf record-p `pgrep-d', 'nginx`

Record the performance data when performing ls:

# perf record ls-g

Record the system calls when the ls is executed, and you can know which system calls are the most frequent:

# perf record-e syscalls:sys_enter ls

Perf-report

Read the data file created by perf record and give the results of hot spot analysis.

Read perf.data (created by perf record) and display the profile.

This command displays the performance counter profile information recorded via perf record.

(1) Common parameters

-i:Input file name. (default: perf.data)

(2) use examples

# perf report-I perf.data.2

In addition to the above five common tools, there are also some tools suitable for more special scenarios, such as kernel locks, slab allocators, and schedulers.

Custom probe points are also supported.

Perf-lock

Performance analysis of kernel lock.

Analyze lock events.

Perf lock {record | report | script | info}

Requires support for compilation options: CONFIG_LOCKDEP, CONFIG_LOCK_STAT.

CONFIG_LOCKDEP defines acquired and release events.

CONFIG_LOCK_STAT defines contended and acquired lock events.

(1) Common options

-I: input file

-k: sorting key, default is acquired, and can also be sorted by contended, wait_total, wait_max, and wait_min.

(2) use examples

# perf lock record ls / / record

# perf lock report / / report

(3) output format

[plain] view plain copy Name acquired contended total wait (ns) max wait (ns) min wait (ns) & mm- > page_table_... 382000 mm-> page_table_... 72000 mm-> lock 64000 0dcache_lock 62000 0vfsmount_lock 43000mm newf-> file_lock... 41 0 0 0

Name: name of the kernel lock.

Aquired: the number of times the lock was acquired directly, because no other kernel path occupies the lock, so you don't have to wait.

Contended: the number of times the lock is acquired after waiting, which is occupied by other kernel paths and needs to wait.

Total wait: the total wait time to acquire the lock.

Max wait: the maximum wait time to acquire the lock.

Min wait: the minimum wait time to acquire the lock.

Finally, there is a Summary:

[plain] view plain copy = = output for debug===bad: 10, total: 246bad rate: 4.065041 histogram of events caused bad sequenceacquire: 0acquired: 0contended: 0release: 10perf-kmem

Performance analysis of slab allocator.

Tool to trace/measure kernel memory (slab) properties.

Perf kmem {record | stat} []

(1) Common options

-- I: input file

-- caller:show per-callsite statistics, showing where kmalloc and kfree are called in the kernel.

-- alloc:show per-allocation statistics, showing the allocated memory address.

-l: print n lines only, showing only num lines.

-s: sort the output (default: frag,hit,bytes)

(2) use examples

# perf kmem record ls / / record

# perf kmem stat-caller-alloc-l 20 / / report

(3) output format

[plain] view plain copy-Callsite | Total_alloc/Per | Total_req/ Per | Hit | Ping-pong | Frag-perf_event_mmap+ec | 311296and8192 | | 155952 37.500%__kmalloc_node+4d 4104 | 38 | 0 | 49.902%proc_reg_open+41 | 64thumb 64 | 40 CPR 40 | 1 | 0 | 37.500%__kmalloc_node+4d | 1024 CPR | 664Comp664 | 1 | 0 | 35.156%ext3_readdir+5bd | 64 SCA 64 | 48 | 1 | 0 | 25.000%load_elf_binary+8ec | 512 CPR 512 | 392 CPR 392 | 1 | 0 | 23.438% |

Callsite: where kmalloc and kfree are called in kernel code.

Total_alloc/Per: the total amount of memory allocated, the average amount of memory allocated each time.

Total_req/Per: the total amount of memory requested, the average memory size per request.

Hit: the number of calls.

The number of times that Ping-pong:kmalloc and kfree are not executed by the same CPU, which results in inefficient cache.

Frag: percentage of fragments, fragmentation = allocated memory-requested memory, which is wasted.

If you use the-- alloc option, you will also see Alloc Ptr, the address of the allocated memory.

Finally, there is a Summary:

[plain] view plain copy SUMMARY=Total bytes requested: 290544Total bytes allocated: 447016Total bytes wasted on internal fragmentation: 156472Internal fragmentation: 35.003669%Cross CPU allocations: 2/509probe-sched

Scheduling module analysis.

Trace/measure scheduler properties (latencies)

Perf sched {record | latency | map | replay | script}

(1) use examples

# perf sched record sleep 10 / / perf sched record

# perf report latency-sort max

(2) output format

TASK: process name and pid.

Runtime: actual elapsed time.

Switches: the number of times the process has switched.

Average delay: average scheduling delay.

Maximum delay: maximum scheduling delay.

Maximum delay at: the time when the maximum scheduling delay occurs.

Perf-probe

Probe points can be customized.

Define new dynamic tracepoints.

Use examples

(1) Display which lines in schedule () can be probed

# perf probe-line schedule

Those with a line number in front can be detected, but not without a line number.

(2) Add a probe on schedule () function 12th line.

# perf probe-a schedule:12

Add a probe point at 12 of the schedule function.

Reference

[1]。 Linux's series of system-level performance profiling tools, by Chenggang

[2]。 Http://www.ibm.com/developerworks/cn/linux/l-cn-perf1/

[3]。 Http://www.ibm.com/developerworks/cn/linux/l-cn-perf2/

[4]。 Https://perf.wiki.kernel.org/index.php/Tutorial

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.