In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-31 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)06/01 Report--
In this issue, the editor will bring you what are the performance parameters of the Linux server. The article is rich in content and analyzes and describes it from a professional point of view. I hope you can get something after reading this article.
When a server based on Linux operating system runs, it will also represent a variety of parameter information. Generally speaking, operators and system administrators will be extremely sensitive to this data, but these parameters are also very important to developers, especially if your program is not working properly. These clues often help to quickly locate and track problems.
Here are just some simple tools to view the relevant parameters of the system, of course, many tools also work by analyzing the data under / proc and / sys, while more detailed and professional performance monitoring and tuning may require more professional tools (perf, systemtap, etc.) and technology to complete. After all, system performance monitoring itself is a great knowledge.
1. CPU and memory class 1.1 top
➜~ top
The three values after the first line are the average load of the system in the previous 1, 5, and 15. You can also see that the system load is rising, steady, and downward. When this value exceeds the number of CPU executable units, it indicates that the performance of CPU has become a bottleneck.
The second line counts the task status information of the system. Running is naturally needless to say, including those running on CPU and those that will be scheduled to run; sleeping is usually tasks waiting for events (such as IO operations) to be completed, and subdivision can include interruptible and uninterruptible types; stopped is some paused tasks, which can usually be paused by sending SIGSTOP or operating Ctrl-Z on a foreground task. For zombie zombie tasks, although process termination resources are automatically reclaimed, task descriptor containing exiting tasks can only be released after access by the parent process. This process is displayed as defunct, regardless of whether the parent process exits in advance or is not called by wait. This process should pay special attention to whether the program is designed incorrectly.
The third line CPU occupancy rate has the following situations depending on the type:
The time (nice0) spent by √ (us) user:CPU in low nice value (high priority) user mode. The default newly started process nice=0 will not be counted here unless the nice value of the program is manually modified by renice or setpriority ()
Time spent by √ (id) idle:CPU in idle state (performing kernel idle handler)
√ (wa) iowait: time spent waiting for IO to finish doing
√ (hi) irq: the time taken by the system to handle hardware interrupts
√ (si) softirq: the time it takes for the system to process soft interrupts. Remember that soft interrupts are divided into softirqs, tasklets (which is actually a special case of the former) and work queues. I don't know which time is counted here. After all, the execution of work queues is no longer an interrupt context.
√ (st) steal: it only makes sense in the case of a virtual machine, because the CPU under the virtual machine also shares the physical CPU, so this period of time indicates the time that the virtual machine waits for hypervisor to schedule CPU. It also means that during this period, hypervisor dispatches the CPU to other CPU for execution, and the CPU resources of this period are "stolen". This value is not zero on my KVM VPS machine, but it is only an order of magnitude of 0.1. can it be used to determine whether VPS is oversold?
In many cases, high CPU usage means something, which also points out the corresponding troubleshooting ideas when the server CPU utilization is too high:
√ when the user occupancy rate is too high, it is usually some individual processes that take up a lot of CPU, so it is easy to find the program through top. If you suspect that the program is abnormal, you can find hot calling functions through perf and other ideas to further check.
√ when the system occupancy rate is too high, if there are more IO operations (including terminal IO), it may result in a high CPU occupancy rate in this part, such as on file server, database server and other servers, otherwise (for example, > 20%) there are likely to be problems with some kernel and driver modules.
√ when the nice occupancy rate is too high, it is usually intentional. When the initiator of a process knows that some processes occupy a high CPU, it will set its nice value to ensure that other processes' requests for CPU will not be flooded.
√ when the iowait occupancy rate is too high, it usually means that the IO operation efficiency of some programs is very low, or the performance of IO corresponding devices is so low that read and write operations take a long time to complete.
√ when the irq/softirq occupancy rate is too high, it is likely that some peripherals have problems, resulting in a large number of irq requests. At this time, check the / proc/interrupts file to explore the problem.
√ when the steal occupancy rate is too high, the black heart manufacturer virtual machine is oversold!
The fourth and fifth lines are information about physical memory and virtual memory (swap partition):
Total = free + used + buff/cache, now buffers and cached Mem information add up, but the relationship between buffers and cached Mem is not clear in many places. In fact, by comparing the data, these two values are the Buffers and Cached fields in / proc/meminfo: Buffers is a block cache for raw disk, mainly caching file system metadata (such as super block information, etc.) in the way of raw block. This value is generally small (about 20m). Cached is used for read caching of some specific files in order to increase the efficiency of file access, which can be said to be used for file caching in the file system.
Avail Mem is a new parameter value that indicates how much memory can be given to newly opened programs without swapping, which is roughly the same as free + buff/cached, which confirms the above argument that free + buffers + cached Mem is the real available physical memory. Also, using swap partitions is not necessarily a bad thing, so swap partition usage is not a serious parameter, but frequent swap in/out is not a good thing, which needs to be noted, which usually indicates a shortage of physical memory.
Finally, there is a list of resource occupancy for each program, where CPU usage is the sum of all CPU core occupancy rates. Usually when top is executed, the program itself will have a large number of read / proc operations, so basically the top program itself will be at the top of the list.
Although top is very powerful, it is usually used in the console to monitor system information in real time. It is not suitable to monitor the load information of the system for a long time (days, months). At the same time, statistics can not be given for short-lived processes.
1.2 vmstat
Vmstat is another commonly used system detection tool in addition to top. The screenshot below is the system load I compiled boost with-J4.
R indicates the number of runnable processes, and the data roughly match; b indicates the number of processes uninterruptible sleeps; swpd indicates the number of virtual memory used, which has the same meaning as the value of top-Swap-used, while as stated in the manual, the number of buffers is usually much smaller than cached Mem, buffers is generally of the order of magnitude of 20m; the bi and bo of the io domain indicate the number of blocks received and sent to disk per second (blocks/s) The in of the system domain indicates the number of system outages per second (including clock interrupts), and the cs indicates the number of context switches due to process switching.
Speaking of which, I think that when many people struggled with compiling linux kernel in the past, was the-j parameter CPU Core or CPU Core+1? By changing the above-j parameter value to compile boost and linux kernel while enabling vmstat monitoring, it is found that in both cases, context switch basically does not change, and only when the-j value is significantly increased will context switch increase significantly. It seems that there is no need to worry too much about this parameter, although I have not tested the specific length of compilation time. It is said that if it is not in the state of system startup or benchmark, there must be something wrong with the parameter context switch > 100000 program.
1.3 pidstat
If you want to track a process comprehensively and concretely, there is nothing more appropriate than pidstat-- a panoramic view of stack space, page faults, active and passive switching, and so on. The most useful argument to this command is-t, which lists the details of each thread in the process.
-r: displays page fault and memory usage. A page fault is a page that a program needs to access that is mapped in virtual memory space but has not yet been loaded into physical memory. The two main types of page fault are
√ minflt/s refers to minor faults. When the physical page you need to visit already exists in physical memory for some reason (such as shared page, cache mechanism, etc.), but there is no reference in the page table of the current process, MMU only needs to set the corresponding entry. This cost is quite small.
√ majflt/s refers to the fact that major faults,MMU needs to apply for a free physical page in the currently available physical memory (if there are no available physical pages, you need to switch other physical pages to swap space to free free physical pages), then load data into the physical page from the outside and set up the corresponding entry. This cost is quite high, and there are several data-level differences from the former.
-s: stack usage, including the stack space reserved by StkSize for threads and the stack space actually used by StkRef. Using ulimit-s, it is found that the default stack space above CentOS 6.x is 10240K, while that of CentOS 7.x and Ubuntu series is 8196K.
-u:CPU utilization. The parameters are similar to those before.
-w: the number of thread context switches, which is also subdivided into active switches caused by cswch/s due to waiting resources and other factors, and statistics of passive switches caused by CPU time of nvcswch/ s threads
It will be troublesome to ps the pid of the program before operating pidstat every time, so the-C of this killer's mace can specify a string, and then if the string is included in the Command, then the information of the program will be printed and counted, and-l can display the complete program name and parameters.
➜~ pidstat-w-t-C "ailaw"-l
In this way, if you look at a single, especially multithreaded task, pidstat works better than the commonly used ps!
1.4 other
When you need to monitor a single CPU separately, you can use mpstat in addition to htop to see if the workload of each Core on the SMP processor is balanced and whether there are some hot threads occupying the Core.
➜~ mpstat-P ALL 1
If you want to directly monitor the resources consumed by a process, you can use top-u taozj to filter out other user-independent processes, or you can choose the following way. The ps command can customize the entry information that needs to be printed:
While:; do ps-eo user,pid,ni,pri,pcpu,psr,comm | grep 'ailawd'; sleep 1; done
To sort out the inheritance relationship, the following common parameter can be used to display the process tree structure, which is much more detailed and beautiful than pstree
➜~ ps axjf
2. Disk IO class
Iotop can visually display the real-time disk reading rate of each process and thread; lsof can not only display the open information of ordinary files (users), but also operate the open information of files such as / dev/sda1, for example, when the partition cannot be umount, you can find out the usage status of the partition through lsof, and add + fg parameter to display the file opening flag tag.
2.1 iostat
➜~ iostat-xz 1
In fact, whether you use iostat-xz 1 or sar-d 1, the important parameters for disk are:
√ avgqu-s: the average length of the wait queue sent to the device Istroke O request. A value greater than 1 for a single disk indicates that the device is saturated, except for logical disks for multiple disk arrays
√ await (r_await, w_await): the average waiting time (ms) for each device Imax O request operation, including the sum of the time that the request was queued and served
√ svctm: the average service time (ms) of the request sent to the device Imaco. If the svctm is close to the await, it means that there is almost no Imaco waiting, and the disk performance is very good, otherwise the waiting time of the disk queue is longer and the disk response is poor.
√% util: the utilization of the device, indicating that the percentage of time spent on util O work per second, the performance of a single disk degrades when% await > 60%, and the device is saturated when it is close to 100%, except for logical disks with multiple disk arrays
Also, although the disk performance detected is poor, it does not necessarily affect the response of the application. The kernel usually uses I asynchronously O disk technology and read-write caching technology to improve performance, but this is constrained by the above physical memory limitations.
The above parameters are also useful for network file systems.
III. Network class
The importance of network performance to the server is self-evident. The tool iptraf can intuitively show the sending and receiving speed information of the network card, and similar throughput information can also be obtained through sar-n DEV 1, while network cards are equipped with the maximum rate information, such as 100-megabit network card and gigabit network card, so it is easy to check the utilization of the device.
In general, the transmission rate of the network card is not the most concern in network development, but for specific UDP, TCP connection packet loss rate, retransmission rate, network delay and other information.
3.1 netstat
➜~ netstat-s
Displays the overall data information of each protocol since the system was started. Although the parameter information is rich and useful, the cumulative value can not get the network state information of the current system unless the two runs are bad, or use watch eyes to visualize its numerical change trend. So netstat is usually used to detect port and connection information:
Netstat-all (a)-numeric (n)-tcp (t)-udp (u)-timers (o)-listening (l)-program (p)
-timers can cancel the reverse query of the domain name to speed up the display. The more commonly used ones are
➜~ netstat-antp # lists all TCP connections
➜~ netstat-nltp # lists all local TCP listening sockets without adding the-a parameter
3.2 sar
Sar this tool is too powerful, what CPU, disk, page exchange everything, the use of-n here is mainly used to analyze network activities, although the network it also gives a breakdown of NFS, IP, ICMP, SOCK and other levels of various protocol data information, we only care about TCP and UDP. The following command not only shows the sending and receiving of segments and datagrams under normal circumstances, but also includes
TCP
➜~ sudo sar-n TCP,ETCP 1
√ active/s: locally initiated TCP connections, such as through connect (), the status of TCP changes from CLOSED-> SYN-SENT
√ passive/s: a remotely initiated TCP connection, such as through accept (), where the status of the TCP changes from LISTEN-> SYN-RCVD
√ retrans/s (tcpRetransSegs): the number of TCP retransmissions per second. Usually, in the case of poor network quality or packet loss after server overload, the retransmission operation will occur according to TCP's confirmation retransmission mechanism.
√ isegerr/s (tcpInErrs): error packets are received per second (such as checksum failure)
UDP
➜~ sudo sar-n UDP 1
√ noport/s (udpNoPorts): the number of datagrams received per second with no application at the specified destination port
√ idgmerr/s (udpInErrors): the number of datagrams received locally but not dispatched by the machine except for the above reasons
Of course, these data can show the reliability of the network to some extent, but it only makes sense when combined with specific business requirements scenarios.
3.3 tcpdump
Tcpdump has to be a good thing. Everyone knows that you like to use wireshark when debugging locally, but what if there is a problem with the online server?
Appendix references give the idea: restore the environment, use tcpdump to capture the package, when the problem recurs (such as log display or a state appears), you can end the package capture, and tcpdump itself with-C tcpdump W parameter, you can limit the size of the crawl package storage file, when this limit is reached, the package data saved automatically rotate, so the overall number of grab packets is controllable. After that, take the data package offline and look at it any way you want with wireshark. Wouldn't you be happy? Although tcpdump does not have a GUI interface, the function of capturing packets is not weak. You can specify various filtering parameters such as network card, host, port, protocol, etc., and the captured packets are complete and time-stamped, so the packet analysis of online programs can be so simple.
The following is a small test. It can be seen that when Chrome starts up, it automatically initiates and establishes three connections to Webserver. Because the dst port parameter is limited here, the response packet of the server is filtered out. If you take it down and open it with wireshark, the process of establishing a connection between SYNC and ACK is still obvious! When using tcpdump, you need to configure the crawling filter conditions as much as possible, on the one hand, to facilitate the following analysis, and second, the opening of tcpdump will affect the performance of the network card and system, which in turn will affect the performance of online business.
These are the performance parameters of the Linux server shared by the editor. If you happen to have similar doubts, please refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.