How to diagnose the performance of Linux 07/06 Update SLTechnology News&Howtos

How to diagnose the performance of Linux

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article will explain in detail how to diagnose the performance of Linux. The content of the article is of high quality, so the editor will share it for you as a reference. I hope you will have some understanding of the relevant knowledge after reading this article.

When you find a system performance problem on the Linux server, which system metrics will you look at in the first 1 minute?

Netflix has a large EC2 cluster on AWS, as well as a variety of performance analysis and monitoring tools. For example, we use Atlas to monitor the entire platform and use Vector to analyze the performance of EC2 instances in real time. These tools have been able to help us solve most of the problems, but sometimes we still have to log into the machine and use some standard Linux performance analysis tools to locate the problem.

In this article, the Netflix performance engineering team introduces some of the standard Linux command-line tools we use to analyze and locate problems within the first 60 seconds of finding them. In these 60 seconds, you can use the following 10 command lines to understand the overall operation of the system, as well as the current running process on the use of resources. Among these indicators, we first focus on the indicators related to errors and resource saturation, and then look at resource utilization. Relatively speaking, errors and resource saturation rates are easier to understand. Saturation means that the load on a resource (CPU, memory, disk) is more than it can handle, and what we observe is that the request queue begins to pile up, or the request wait time is longer.

Uptime dmesg | tail vmstat 1 mpstat-P ALL 1 pidstat 1 iostat-xz 1 free-m sar-n DEV 1 sar-n TCP,ETCP 1 top

Some command lines rely on the sysstat package. Through the use of these command lines, you can familiarize yourself with a set of methods or processes commonly used to analyze system performance problems: USE. This method mainly analyzes all resources (CPU, memory, disk, etc.) from three aspects: resource utilization (Utilization), resource saturation (Satuation), and error (Error). In the process of this analysis, we should always pay attention to the resource problems that we have eliminated, in order to narrow the scope of our positioning and provide a more clear direction for the next step of positioning.

The following sections describe each command line and use our data in the production environment as an example. For a more detailed description of these command lines, check the appropriate help documentation.

1. Uptime$ uptime 23:51:26 up 21:31, 1 user, load average: 30.02, 26.43, 19.02

This command can quickly check the average load of the system, and you can think that the value of this load shows how many tasks are waiting to run. In Linux systems, this includes tasks that want or are using CPU, as well as tasks that are blocked on io. This command gives us a general idea of the global state of the system, but we still need to use other tools to get more information.

These three values are the exponentially weighted dynamic averages of 1 minute, 5 minutes and 15 minutes calculated by the system, which can be simply regarded as the average value in this period of time. According to these three values, we can understand the change of system load over time. For example, suppose there is something wrong with the system now, and you look at these three values and find that the load value of 1 minute is much smaller than the load value of 15 minutes, then you may have missed the time when the system went wrong.

In the above example, the average load shows that the average load is 30 per minute, compared with 19 for 15 minutes. There are many reasons for the increase in load, perhaps because CPU is not enough; vmstat or mpstat can further confirm what the problem is.

2. Dmesg | tail$ dmesg | tail [1880957.563150] perl invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0 [...] [1880957.563400] Out of memory: Kill process 18694 (perl) score 246 or sacrifice child [1880957.563408] Killed process 18694 (perl) total-vm:1972392kB, anon-rss:1953348kB, file-rss:0kB [2320864.954447] TCP: Possible SYN flooding on port 7001. Dropping request. Check SNMP count

This command shows the latest system logs. Here we mainly find out if there are some system errors that can cause performance problems. The above example includes oom-killer and TCP packet loss.

Don't skip this step! dmesg is always worth seeing.

3. Vmstat 1$ vmstat 1 procs-memory- swap---io-----system---cpu- r b swpd free buff cache si so bi bo in cs us sy id wa st 34 00 200889792 73708 591828 00 0 00 5 6 10 96 1 3 00 32 00 200889920 73708 591860 00 0592 13284 4282 98 11 00 32 00 200890112 73708 591860 00 00 9501 2154 99 1 00 0 32 00 200889568 73712 591856 00 0 48 11900 2459 99 00 32 00 200890208 73712 591860 00 15898 4840 98 11 00 ^ C

Vmstat shows something about virtual memory and CPU. In the above example, the command line 1 indicates that it is displayed every 1 second. In this version of vmstat, the first line represents the metrics since this launch, and we can ignore the first line for the time being.

Metrics to view:

R: tasks in the runnable state, including running tasks and waiting tasks. This value is a better indicator of whether the CPU is saturated than the average load. This value does not include waiting for io-related tasks. When the value of r is larger than the current number of CPU, the system is saturated. Free: the amount of free memory in KB. Si,so: swap in and out of memory pages. If these two values are non-zero, there is not enough memory. The indicators of us,sy,id,wa,st:CPU time (averaging all CPU), respectively: user state time, kernel state time, idle time, waiting io, stealing time (the cost of the system on other tenants in a virtualized environment)

By adding up the user mode CPU time (us) and the kernel mode CPU time (sy), we can further confirm whether the CPU is busy. If the waiting time for IO (wa) is high, the disk is the bottleneck; note that this is also included in the idle time (id), CPU is also idle at this time, and the task is blocked on disk IO at this time. You can think of waiting time for IO (wa) as another form of CPU idle, which can tell you why CPU is idle.

When the system processes IO, it will definitely consume kernel state time (sy). If the kernel state time is longer, for example, more than 20%, we need further analysis. Maybe the processing efficiency of kernel IO is not high.

In the above example, most of the CPU time is spent in the user mode, indicating that the code at the application layer is mainly using CPU. CPU utilization (us + sy) is also more than 90%, which is not necessarily a problem; we can determine the saturation of CPU by r and the number of CPU.

4. Mpstat-P ALL 1$ mpstat-P ALL 1 Linux 3.13.0-49-generic (titanclusters-xxxxx) 07Ta 14Accord 2015 _ x86 September 6432 CPU 07:38:49 PM CPU% usr% nice% sys% iowait% irq% soft% steal% guest% gnice% idle 07:38:50 PM all 98.47 0.75 0.00 0.00 0.00 . 00 0.78 07:38:50 PM 0 96.04 0.00 2.97 0.00 0.00 0.00 0.99 07:38:50 PM 1 97.00 0.00 1.00 0.00 0.00 0.00 2.00 07:38:50 PM 2 98.00 0.00 1.00 0.00 0.00 0.00 1.00 07:38:50 PM 3 96.97 0.00 0.00 0.00 3.03 [...]

This command prints out the time of each CPU to see if the CPU handles the task evenly. For example, if the usage of a single CPU is high, it is a single-threaded application.

5. Pidstat 1$ pidstat 1 Linux 3.13.0-49-generic (titanclusters-xxxxx) 07 rcuos/0 07:41:02 PM UID PID% usr% system% guest% CPU CPU Command 07:41:03 PM 0 9 0.00 0.94 0.94 1 rcuos/0 07:41:03 PM 0 4214 5.66 5.66 0.00 11.32 15 mesos-slave 07:41:03 PM 0 4354 0.94 0.94 0.00 1.89 8 java 07:41:03 PM 0 6521 1596.23 1.89 0.00 1598.11 27 java 07:41:03 PM 0 6564 1571.70 7.55 0.00 1579.25 28 java 07:41:03 PM 60004 60154 0. 94 4.72 0.00 5.66 9 pidstat07:41:03 PM UID PID% usr% system% guest% CPU CPU Command 07:41:04 PM 0 4214 6.00 2.00 8.00 15 mesos-slave 07:41:04 PM 0 6521 1590.00 1.00 1591.00 27 java 07:41:04 PM 0 6564 1573. 00 10.00 0.00 1583.00 28 java 07:41:04 PM 108 6718 1.00 0.00 0.00 snmp-pass 07:41:04 PM 60004 1.00 4.00 0.00 5.00 9 pidstat ^ C

Pidstat is very similar to top, except that it can print every other interval, instead of clearing the screen every time, as top does. This command can easily view the possible behavior patterns of the process, you can also directly copy past, you can easily record the changes in the running status of each process over time.

The above example shows that there are two Java processes that consume a lot of CPU. The% CPU here indicates the value for all CPU, such as 1591% indicating that this Java process consumes almost 16 CPU.

6. Iostat-xz 1$ iostat-xz 1 Linux 3.13.0-49-generic (titanclusters-xxxxx) 07Accord14 to 2015 x 86 million 64 (32 CPU) avg-cpu:% user% nice% system% iowait% steal% idle 73.96 0.00 3.73 0.03 0.06 22.21 Device: rrqm/s wrqm/s Rexort s wdeband rkB/s wkB/s avgrq-sz avgqu-sz await r_await W_await svctm% util xvda 0.00 0.23 0.21 0.18 4.52 2.08 34.37 0.00 9.98 13.80 5.42 2.44 0.09 xvdb 1.02 8.94 127.97 598.53 0.00 0.43 1.78 0.28 0.25 0.25 xvdc 0.01 0.00 1.02 8.86 127.79 595.94 146.50 0.00 0.45 1.82 0.30 0.27 0.26 dm-0 0.00 0.00 0.69 2.32 10.47 31.69 28.01 0.01 3.23 0.71 3.98 0.13 0.04 dm -1 0.00 0.00 0.00 0.94 0.01 3.78 8.00 0.33 345.84 0.04 346.81 0.01 0.00 dm-2 0.00 0.09 0.07 1.35 0.36 22.50 0.00 2.55 0.23

Iostat is an important tool for understanding the current load and performance of block devices (disks). The meaning of several indicators:

The number of reads per second, the number of writes per second, the amount of data read per second, the amount of data written per second sent by the system to the device. These indicators reflect the workload of the system. The performance problem of the system is probably that the load is too heavy. Await: the average response time of requests sent by the system to the IO device. This includes the time that the request is queued and the time it takes to process the request. The average response time that exceeds the empirical value indicates that the device is saturated or that there is a problem with the device. Avgqu-sz: the average length of the device request queue. A queue length greater than 1 indicates that the device is saturated. % util: device utilization. The degree to which the device is busy, indicating the percentage of time the device spent processing IO per second. Utilization greater than 60% usually causes performance problems (as can be seen through await), but each device also varies. Close to 100% utilization indicates that the disk is saturated.

If the block device is a logical block device and there are many physical disks behind the logical fast device, 100% utilization can only indicate that the processing time of some IO is up to 100%. The physical disks at the back end may be far from saturated and can handle more loads.

It is also important to note that poor disk IO performance does not necessarily mean that there is a problem with the application. Applications can have many ways to perform asynchronous IO without blocking the IO; applications can also use techniques such as pre-read and write buffering to reduce the impact of IO latency on itself.

7. Free-m $free-m total used free shared buffers cached Mem: 245998 24545 221453 83 59 541-/ + buffers/cache: 23944 222053 Swap:

The two columns on the right are explicit:

Buffers: the buffer cache used for block device Ibank O. Cached: the page cache used for the file system.

We just want to check for these sizes that are not close to zero, which may result in higher disk Imax O (using iostat confirmation), and worse performance. The above example looks good, with many M sizes for each column.

The memory usage provided by-/ + buffers/cache is more accurate than the first line. Linux uses temporarily unused memory as a cache and reallocates it as soon as the application needs it. So part of the memory used for caching is actually free memory. To explain this, someone even set up a website: http://www.linuxatemyram.com/.

If you use ZFS, it can be a little confusing. ZFS has its own file system cache, which is not visible in free-m; the system looks like there is not much free memory, but it is possible that ZFS has a lot of cache available.

8. Sar-n DEV 1$ sar-n DEV 1 Linux 3.13.0-49-generic (titanclusters-xxxxx) 07 Accord 14 CPU 2015 _ x86 dollars 6432 CPU 12:16:48 AM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s% ifutil 12:16:49 AM eth0 18763.00 5032.00 20686.42 478.30 0.00 0.00 0.0012:16:49 AM lo 14.00 14.00 1.36 1.36 0.00 0.00 0.00 0.0012:16:49 AM docker0 0.00 0.00 0.00 0.0012:16:49 AM IFACE rxpck/s txpck / s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s% ifutil 12:16:50 AM eth0 19763.00 5101.00 21999.10 482.56 0.00 0.00 12:16:50 AM lo 20.00 20.00 3.25 3.25 0.00 0.00 0.00 12:16 50 AM docker0 0.00 0.00 0.00 ^ C

This tool can check the throughput of the network interface: rxkB/s and txkB/s can measure the load and see if the network traffic limit has been reached. In the above example, the throughput of eth0 is about 22 Mbytes/s, or about 176 Mbits/sec, much less than 1 Gbit/sec.

In this example, there is also% ifutil to identify the device utilization, which is also measured by Brenan's nicstat tool. Like nicstat, this device utilization is difficult to measure correctly, and there seems to be something wrong with this value in the above example.

9. Sar-n TCP,ETCP 1$ sar-n TCP ETCP 1 Linux 3.13.0-49-generic (titanclusters-xxxxx) 07 CPU 14 AM atmptf/s estres/s retrans/s isegerr/s orsts/s 2015 _ x86 August 64 _ (32 CPU) 12:17:19 AM active/s passive/s iseg/s oseg/s 12:17:20 AM 1.00 10233.00 18846.0012 49-generic 17 49-generic 19 AM atmptf/s estres/s retrans/s isegerr/s orsts/s 0.00 0.00 0.00 0.00 0.0012 AM atmptf/s estres/s retrans/s isegerr/s orsts/s 17 AM active/s passive/s iseg/s oseg/s 20 AM 12:17:21 0.00 0.00 8359.00 6039.0012 AM atmptf/s estres/s retrans/s isegerr/s orsts/s 17 AM atmptf/s estres/s retrans/s isegerr/s orsts/s 0.00 0.00 0.00 ^ C

This is a summary of key TCP metrics, including:

Active/s: the TCP connection opened actively locally, that is, the local program uses connect () system call passive/s: the TCP connection initiated from the source side per second, that is, the connection retrans/s accepted by the local program using accept (): the number of TCP retransmissions per second atctive and passive can usually be used to measure the load of the server: the number of connections accepted (passive) and the number of downstream connections (active). You can simply think of active as the connection of the outgoing host and passive as the connection of the incoming host, but this is not very strict, such as the connection between loalhost and localhost.

Retransmission indicates a problem with the network or server. Maybe the network is unstable, maybe the server is overloaded and begins to lose packets. The above example indicates that only 1 new connection is established per second.

10. Top$ top top-00:15:40 up 21:56, 1 user, load average: 31.09,29.87,29.92 Tasks: 871 total, 1 running, 868 sleeping, 0 stopped, 2 zombie% Cpu (s): 96.8 us, 0.4 sy, 0.0 ni, 2.7 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem: 25190241+total, 24921688 used, 22698073+free, 60448 buffers KiB Swap: 0 total 0 used, 0 free. 554208 cached MemPID USER PR NI VIRT RES SHR S% CPU% MEM TIME+ COMMAND 20248 root 20 0 0.227t 0.012t 18748 S 3090 5.229812 java 4213 root 20 2722544 64640 44232 S 23.5 233R1.00.0 mesos-slave 00.07 top 5235 root 20 038.227g 54700449996 S 0.70.22 : 02.74 java 4299 root 20 0 20.015g 2.682g 16836 S 0.3 1.1 33:14.42 java 1 root 20 0 33620 2920 1496 S 0.0 0.0 0:03.82 init 2 root 20 00 00 S 0.0 0.0 0:00.02 kthreadd 3 root 20 00 00 S 0.0 0.0 0 : 05.35 ksoftirqd/0 5 root 0-20 000 S 0.0 0.00: 00.00 kworker/0:0H 6 root 20 00 00 S 0.0 0.00: 06.94 kworker/u256:0 8 root 20 00 S 0.0 2 root 38.05 rcu_sched

The top command covers many of the metrics we described earlier. We can use it to see if it is very different from what we have seen before, and if so, it means that the load of the system is changing.

The disadvantage of top is that it is difficult to find some behavior patterns of these indicators over time, in which case commands such as vmstat or pidstat that provide scrolling output are better. If you don't pause the output fast enough (Ctrl-S pause, Ctrl-Q continues), some clues to intermittent problems may also be lost by being cleared.

What is Linux system Linux is a free-to-use and free-spread UNIX-like operating system, is a POSIX-based multi-user, multi-task, multi-threaded and multi-CPU operating system, using Linux can run major Unix tools, applications and network protocols.

On how to diagnose the performance of Linux to share here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.