How to analyze the performance of Linux 07/09 Update SLTechnology News&Howtos

How to analyze the performance of Linux

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

Today, I will talk to you about how to analyze the performance of Linux. Many people may not know much about it. In order to make you understand better, the editor has summarized the following contents for you. I hope you can get something according to this article.

When you log in to a Linux server and have to do a performance analysis because of a question: what tests will you do in the first minute?

At Netflix, we have many EC2 Linux machines, and we also need a lot of performance analysis tools to monitor and check their performance. It includes Atlas, a monitoring tool for the cloud, and Vector for case analysis as needed. Although these tools can help us solve most problems, we sometimes need to log in to machine instances to run some standard Linux performance analysis tools.

The first 60 seconds: summary

In this article, Netflix's team of performance analysis engineers will show you how to use existing Linux standard tools for performance optimization testing in command-line mode for the first 60 seconds. You only need to run the following 10 commands in 60 seconds to have a high level of understanding of system resource usage and running processes. Look for error messages and saturation indicators, and can be displayed as the length of the request queue, or the length of the wait. Because they are easy to understand, and then there is resource utilization. Saturation means that a resource has exceeded its own load capacity.

Uptime dmesg | tail vmstat 1 mpstat-P ALL 1 pidstat 1 iostat-xz 1 free-m sar-n DEV 1 sar-n TCP,ETCP 1 top

Some commands require sysstat toolkits to be installed. The metrics shown in these commands will help you complete some USE (Utilization,Saturation,Errors) methods: methodology for locating performance bottlenecks. This includes checking usage (Utilization), saturation (Saturation), and error metrics (Errors) for all resources (such as CPU, memory, disk, etc.). Also pay attention to when you check and rule out a resource problem, because exclusion narrows the scope of the analysis and guides any subsequent checks.

The following sections will introduce these commands through an example in a production system. To learn more about these tools, you can also check their help manuals.

1. Uptime

$uptime 23:51:26 up 21:31, 1 user, load average: 30.02,26.43,19.02

This is a quick way to show the average load of the system, which also indicates the number of processes waiting to run. In Linux systems, these numbers include the number of processes waiting for CPU to run, as well as processes blocked by uninterruptible Imax O (usually disk IBO). This gives a direct demonstration of the resource load so that the data can be better understood without the help of other tools. It is the only quick way to view the system load.

These three figures are the average of the constant of the past 1 minute, 5 minutes and 15 minutes in a decreasing way. These three numbers show us visually how the system load changes over time. For example, if you are asked to view a problem server and the value of 1 minute is much lower than the value of 15 minutes, you may have missed the time when the problem occurred because you logged on to the machine too late.

In the above example, the average load shows that the average load is increasing, with a value of 30 for 1 minute, an increase compared to a value of 19 for 15 minutes. Such a large number means that something has happened: it could be a CPU requirement; vmstat or mpstat will help determine what it is, and these commands are described in the third and fourth commands of this series.

2. Dmesg | tail

$dmesg | tail [1880957.563150] perl invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0 [...] [1880957.563400] Out of memory: Kill process 18694 (perl) score 246 or sacrifice child [1880957.563408] Killed process 18694 (perl) total-vm:1972392kB, anon-rss:1953348kB, file-rss:0kB [2320864.954447] TCP: Possible SYN flooding on port 7001. Dropping request. Check SNMP counters.

Shown here are the last 10 system message logs, which will not be displayed if the system message is not present. It mainly looks at the errors caused by performance problems. The above example includes the process that kills the OOM problem and discards the TCP request.

So remember to use this command, the dmesg command is worth using.

3. Vmstat 1

$vmstat 1 procs-memory--swap---io-----system---cpu- r b swpd free buff cache si so bi bo in cs us sy id wa st 34 00 200889792 73708 591828 00 0 0 5 6 10 96 1 3 00 32 00 200889920 73708 591860 00 0592 13284 4282 98 11 00 32 00 200890112 73708 591860 00 9501 2154 99 1 00 0 32 00 200889568 73712 591856 00 0 48 11900 2459 99 00 00 32 00 200890208 73712 591860 00 15898 4840 98 11 00 ^ C

For a brief presentation of virtual memory statistics, vmstat is a commonly used tool (originally created for BSD decades ago). It prints a statistical summary of key service information on each line.

When vmstat runs with parameter 1, it prints a statistic every 1 second. In this version of vmstat, the first line of output shows the average since startup, rather than the statistics of the previous second. So now, you can skip the first line unless you want to look at the field meaning of the header.

Description of the meaning of each column:

Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community

R: the number of runnable processes waiting to run on the CPU. This metric provides data to determine the CPU saturation because it does not include the process that I am waiting for. It can be interpreted as: the value of "r" is saturated when it is larger than the CPU number.

Free: free memory in k. If this number is large, it means you have plenty of free memory. "free-m" and the seventh command below can analyze the state of free memory in more detail.

Si,so: the amount of data swapped in and out. If these two values are non-zero, then there is no memory.

Us,sy,id,wa,st: these are the decomposition of CPU time, the average of all CPU. They are user time, system time (kernel), idle time, waiting time for Imax O, and stolen time (here, mainly referring to other customers, or using Xen, which have their own independent operating domains).

The decomposition of CPU time can help determine whether CPU is very busy (judged by the accumulation of user time and system time). The persistent Icano wait indicates that the disk is the bottleneck. In this case, CPU is relatively idle because tasks are blocked by waiting for disk I _ par O. You can think of waiting for Iamp O as another form of CPU idle, and this command gives clues as to why they are idle.

System time is necessary for IWeiO processing. A relatively high average system time consumption, such as more than 20%, is necessary for further exploration and analysis: it may also be caused by the inefficient kernel processing of Ibank O.

In the above example, CPU time is almost always user-level, indicating that this is an application-level usage. If the average CPU usage is more than 90%. This is not necessarily a problem; you can use the "r" column to check usage saturation.

4. Mpstat-p ALL 1

$mpstat-P ALL 1 Linux 3.13.0-49-generic (titanclusters-xxxxx) 07 _ x86 March 6432 CPU 07:38:49 PM CPU% usr% sys% iowait% irq% soft% steal% guest% gnice% idle 07:38:50 PM all 98.47 0.00 0.00 0.00 0. 78 07:38:50 PM 0 96.04 0.00 2.97 0.00 0.00 0.00 0.99 07:38:50 PM 1 97.00 0.00 1.00 0.00 0.00 0.00 2.00 07:38:50 PM 2 98.00 0.00 1.00 0.00 0 . 00 0.00 0.00 0.00 1.00 07:38:50 PM 3 96.97 0.00 0.00 0.00 3.03 [...]

This command prints time statistics for each CPU, and you can see whether the overall CPU usage is balanced. There is a significantly higher utilization of CPU can clearly see that this is a single-threaded application.

5. Pidstat 1

$pidstat 1 Linux 3.13.0-49-generic (titanclusters-xxxxx) 07 rcuos/0 14 go to 2015 _ x86 August 6432 CPU 07:41:02 PM UID PID% usr% system% guest% CPU CPU Command 07:41:03 PM 0 9 0.00 0.94 1 rcuos/0 07:41:03 PM 0 4214 5.66 5 . 66 0.00 11.32 15 mesos-slave 07:41:03 PM 0 4354 0.94 0.94 0.00 1.89 8 java 07:41:03 PM 0 6521 1596.23 1.89 0.00 1598.11 27 java 07:41:03 PM 0 6564 1571.70 7.55 0.00 1579.25 28 java 07:41:03 PM 60004 60154 0.94 4.72 0.00 5.66 9 pidstat 07:41:03 PM UID PID% usr% system% guest% CPU CPU Command 07:41:04 PM 0 4214 6.00 2.00 8.00 15 mesos-slave 07:41:04 PM 0 6521 1590.00 1.00 1591.00 27 java 07:41:04 PM 0 6564 1573.00 10.00 0.00 1583.00 28 java 07:41:04 PM 108 6718 1.00 0.00 0.00 snmp-pass 07:41:04 PM 60004 1.00 4.00 0.00 5.00 9 pidstat ^ C

The pidstat command is a bit like the statistics function for each CPU in the top command, but it prints the information in a way that scrolls and updates constantly, rather than every time it clears the screen. This is useful for observing patterns that change over time, while recording the information you see (copy and paste) in your survey records.

The above example shows that two java processes are consuming CPU. The% CPU column is the utilization of all CPU; 1591% indicates that this java process consumes almost 16 CPU cores.

6. Iostat-xz 1

$iostat-xz 1 Linux 3.13.0-49-generic (titanclusters-xxxxx) 07 iowait 14 CPU avg-cpu:% user% nice% system% iowait% steal% idle 73.96 0.00 3.73 0.03 0.06 22.21 Device: rrqm/s wrqm/s Randall s rkB/s wkB/s avgrq-sz avgqu-sz Await r_await w_await svctm% util xvda 0.00 0.23 0.21 0.18 4.52 2.08 34.37 0.00 9.98 13.80 5.42 2.44 0.09 xvdb 1.02 8.94 127.97 598.53 0.00 0.43 1.78 0.28 0.25 0.25 xvdc 0.01 0.00 1.02 8.86 127.79 595.94 146.50 0.00 0.45 1.82 0.30 0.27 0.26 dm-0 0.00 0.00 0.69 2.32 10.47 31.69 28.01 0.01 3.23 0.71 3.98 0.13 0.04 dm-1 0.00 0.00 0.94 0.01 3.78 8.00 0.33 345.84 0.04 346.81 0.01 0.00 dm-2 0.00 0.09 0.07 1.35 0.36 22.50 0.00 2.55 0.23 5.62 1.78 0.03 [.] ^ C

This tool is useful for understanding block devices, such as disks, and shows request load and performance data. For specific data, see the explanation of the following fields:

Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community

RComps, wUnip s, rkB/s, wkB/s: these represent the number of reads and writes per second and the number of bytes read and written (in k bytes) on the device. These can see the load of the equipment. The performance problem may simply be due to a large number of file load requests.

The average time in milliseconds that the await:I/O waited. This is the time the application waits, including the time it takes to wait in the queue and the time it takes to schedule the service. Excessive average waiting time indicates that the equipment is overloaded or that there is something wrong with the equipment.

Avgqu-sz: average number of requests on the device. A value greater than 1 may indicate that the device is saturated (although devices can usually support parallel requests, especially virtual devices with multiple disks hanging from the back).

% util: device utilization. Is a percentage of usage, showing how long the device works per second. A value greater than 60% can lead to very low performance (which can be seen in await), depending on the characteristics of the device. A value close to 100% indicates that the device is saturated.

If the storage device is a logical disk device with multiple disks mounted behind it, then a utilization of 100% only means that some IWeiO is processed at 100%, while the back-end disks may be far from saturated and can handle more requests.

Keep in mind that the low performance of disk Ihop O is not necessarily a problem with the application. Many techniques are often used to implement asynchronous execution of Icano, so applications do not directly block and withstand latency (such as pre-read and write buffering techniques).

7. Free-m

$free-m total used free shared buffers cached Mem: 245998 24545 221453 83 59 541-/ + buffers/cache: 23944 222053 Swap:

The two columns on the right are:

Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community

Buffers: the cache used for block device iBand O buffering.

Cached: the page cache used for the file system.

We just want to check whether these cached values are close to 0. A non-zero may result in a higher disk Imax O (confirmed by the iostat command) and poor performance problems. The above example looks fine, and there are still a lot of M bytes.

The line "- / + buffers/cache" provides clear statistics on used and free memory. Linux uses free memory as a cache, which can be quickly taken back if the application needs it. So it should include the column of free memory, which is how it is counted here. There is even a website dedicated to the issue of Linux memory consumption: linuxatemyram.

If you use the ZFS file system on Linux, it may be more messy, because when we are developing some services, ZFS has its own file system cache, and this memory consumption is not reasonably reflected in the command free-m. Shows that the system is out of memory, but this part of the ZFS cache can be used by the application.

8. Sar-n DEV 1

$sar-n DEV 1 Linux 3.13.0-49-generic (titanclusters-xxxxx) 07 * 0.00 12:16:49 AM lo 14.00 14.00 1.36 1.36 0.00 0.00 0.00 12:16:49 AM docker0 0.00 0.00 0.00 12:16:49 AM IFACE rxpck/s txpck/s rxkB / s txkB/s rxcmp/s txcmp/s rxmcst/s% ifutil 12:16:50 AM eth0 19763.00 5101.00 21999.10 482.56 0.00 0.00 0.00 12:16:50 AM lo 20.00 3.25 3.25 0.00 0.00 0.00 12:16:50 AM Docker0 0.00 0.00 0.00 ^ C

Using this tool, you can detect the throughput of the network interface: rxkB/s and txkB/s, as a measure of the data load of sending and receiving, as well as whether the sending and receiving limit has been reached. In the above example, eth0 receives data of 22 megabytes per second, or 176 Mbit/ seconds (the upper limit of the network card is 1 Gbit/ seconds).

This version of the tool also has a statistical field:% ifutil, which is used to count device utilization (full-duplex bi-directional maximum), which can also be measured using Brendan's nicstat tool. In this example, it seems that there are no statistics in the case of 0.00. like nicstat, this value is more difficult to count correctly.

9. Sar-n TCP,ETCP 1

$sar-n TCP ETCP 1 Linux 3.13.0-49-generic (titanclusters-xxxxx) 07 AM atmptf/s estres/s retrans/s isegerr/s orsts/s 14 AM atmptf/s estres/s retrans/s isegerr/s orsts/s 2015 _ x86 August 6432 CPU 12:17:19 AM active/s passive/s iseg/s oseg/s 12:17:20 AM 1.00 10233.00 12:17:19 AM atmptf/s estres/s retrans/s isegerr/s orsts/s 12:17:20 AM 0.00 0.00 .00 0.00 0.00 12:17:20 AM active/s passive/s iseg/s oseg/s 12:17:21 AM 1.00 8359.00 6039.00 12:17:20 AM atmptf/s estres/s retrans/s isegerr/s orsts/s 12:17:21 AM 0.00 0.00 0.00 ^ C

This is a statistic of the key metrics of TCP, which includes the following:

Active/s: the number of locally initiated TCP connections per second (for example, connections initiated through connect ()).

Passive/s: the number of connections initiated remotely per second (for example, connections accepted through accept ()).

Retrans/s: number of TCP retransmissions per second.

This active and passive statistics are often used as a rough estimate of the system load: the number of newly accepted connections (passive) and the number of downstream connections (active). You can think of initiative as external and passive as internal, but this is usually not very accurate (for example, when there is a local-to-local connection).

Retransmission is a sign that there is a problem with the network or server; it may be an unreliable network (for example, a public network), or it may be because the server is overloaded and starts to lose packets. In the above example, you can see that a new TCP connection is created every second.

10. Top

Top top-00:15:40 up 21:56, 1 user, load average: 31.09,29.87,29.92 Tasks: 871 total, 1 running, 868 sleeping, 0 stopped, 2 zombie% Cpu (s): 96.8 us, 0.4 sy, 0.0 ni, 2.7 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem: 25190241+total, 24921688 used, 22698073+free, 60448 buffers KiB Swap: 0 total 0 used, 0 free. 554208 cached Mem PID USER PR NI VIRT RES SHR S% CPU% MEM TIME+ COMMAND 20248 root 20 0 0.227t 0.012t 18748 S 3090 5.229812 java 4213 root 20 2722544 64640 44232 S 23.5 233R1.00.0 root 00.07 top 5235 root 20 038.227g 54700449996 S 0.2 2:02.74 java 4299 root 20 0 20.015g 2.682g 16836 S 0.3 1.1 33:14.42 java 1 root 20 0 33620 2920 1496 S 0.0 0.0 0:03.82 init 2 root 20 00 00 S 0.0 0.0 0:00.02 kthreadd 3 root 20 00 0 0 S 0.0 0.00: 05.35 ksoftirqd/0 5 root 0-20 000 S 0.0 0.00: 00.00 kworker/0:0H 6 root 20 00 00 S 0.0 0.00: 06.94 kworker/u256:0 8 root 20 00 00 S 0.02 kworker/u256:0 38.05 rcu_sched

The top command contains many of the metrics we mentioned earlier. This command can easily see that the change in metrics indicates a change in load, which looks very different from the previous command.

One of the drawbacks of top is that it is difficult to see the trend, and other tools such as vmstat and pidstat will be clear that they output statistics in a scrolling manner. So if you don't pause in time when you see a problematic message (Ctrl-S is paused, Ctrl-Q continues), then the useful information will be cleared.

Follow-on Analysis

There are many commands and techniques that can be used to dig deep into system problems. Take a look at Brendan's introduction to the Linux performance tool in 2015, which covers more than 40 commands, covering observability, benchmarking, tuning, static performance tuning, analysis and tracking.

After reading the above, do you have any further understanding of how to analyze the performance of Linux? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.