How to evaluate, monitor, locate and optimize the performance of disk IO and network IO 07/09 Update SLTechnology News&Howtos

How to evaluate, monitor, locate and optimize the performance of disk IO and network IO

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/03 Report--

How to evaluate, monitor, locate and optimize the performance of disk IO and network IO

Some problems such as system throughput degradation and slow response time caused by long IO delay are often encountered in production, such as switch failure, packet loss and retransmission caused by network cable aging, IO delay caused by insufficient stripe width of storage array, insufficient cache, QoS limit, improper RAID level setting and so on.

I. the premise of evaluating the capability of IO

The premise of evaluating the IO capability of a system is to find out what the IO model of the system looks like. So what is the IO model and why refine the IO model?

(1). IO model

In the actual business process, generally speaking, IO is relatively mixed, such as read-write ratio, IO size and so on, there are fluctuations. So when we refine the IO model, we generally build a model for a specific scenario for IO capacity planning and problem analysis.

The most basic models include:

IOPS bandwidth size (size) of IO

If it is a disk IO, then you also need to pay attention to:

The ratio of disk IO to read IO and write IO on which disk IO is read sequentially or randomly? is IO written sequentially or randomly

(2) Why to refine the IO model

Under different models, the maximum values of IOPS, MBPS and response time provided by the same storage, or the same LUN, are different.

When the maximum capacity of IOPS is mentioned in storage, random small IO is generally used for testing, which takes up very low bandwidth and the response time is much longer than that of sequential IO. If you change a random small IO to a sequentially small IO, then the IOPS will be even larger. When the test sequence is large IO, the bandwidth consumption is very high, but the IOPS is very low.

Therefore, for capacity planning and performance tuning of IO, it is necessary to analyze what the IO model of the business is.

II. Assessment tools

(1) disk IO assessment tool

There are many tools to evaluate disk IO capability, such as orion, iometer,dd, xdd, iorate,iozone,postmark. Different tools support different operating system platforms and have their own characteristics in application scenarios.

Some tools can simulate application scenarios, for example, orion is produced by oracle and simulates the IO load of Oracle database (using the same IO software stack as Oracle).

That is, to simulate oracle applications to read and write files or disk partitions (you can specify read-write ratio, io size, sequential or random), you need to know your IO model in advance. If you do not know, you can use automatic mode, let orion automatically run again, you can get different processes of concurrent read and write, the highest IOPS, MBPS, and the corresponding response time.

Comparing dd, it only reads and writes files, and does not simulate the effects of applications, businesses, and scenarios.

Postmark can read, write, create and delete files. Testing suitable for small file application scenarios.

(2) Network IO assessment tools

Ping: basically, you can specify the size of the package.

Iperf, ttcp: test the maximum bandwidth, delay and packet loss of tcp and udp protocols.

There are many tools to measure the bandwidth capacity under the windows platform: NTttcp, LANBench, pcattcp, LAN Speed Test (Lite), NETIO, NetStress.

III. Main monitoring indicators and common monitoring tools

(1) disk IO

Nmon and iostat are good tools for storing IO:unix and linux platforms.

Nmon is used for post-mortem analysis, iostat can be used for real-time viewing, or scripted for post-mortem analysis.

1.IOPS

Total IOPS:Nmon DISK_SUMM Sheet:IO/Sec

Read IOPS: Nmon DISKRIO Sheet corresponding to each disk

Write IOPS: Nmon DISKWIO Sheet corresponding to each disk

Total IOPS: command line iostat-Dl:tps

Read IOPS for each disk: command line iostat-Dl:rps

Write IOPS for each disk: command line iostat-Dl:wps

two。 Bandwidth

Total bandwidth: Nmon DISK_SUMM Sheet:Disk Read KB/s,Disk Write KB/s

Read bandwidth corresponding to each disk: Nmon DISKREAD Sheet

Write bandwidth corresponding to each disk: Nmon DISKWRITE Sheet

Total bandwidth: command line iostat-Dl:bps

Read bandwidth corresponding to each disk: command line iostat-Dl:bread

Write bandwidth corresponding to each disk: command line iostat-Dl:bwrtn

3. Response time

Read response time for each disk: command line iostat-Dl:read-avg serv,max serv

Write response time for each disk: command line iostat-Dl:write-avg serv,max serv

4. Other

Disk busy, queue depth, number of queue fullness per second, and so on.

(2) Network IO

1. Bandwidth

It is best to view the traffic directly at the network device (more accurate), if you can also view it on the business server.

Nmon:NET Sheet

Command line topas:Network:BPS, B-In, B-Out

two。 Response time

For a simple method, you can use the ping command to check whether the delay of ping is within a reasonable range and whether there is packet loss.

Some switches set a lower priority to the ping command and may have delays in replying to and forwarding ping packets, so the ping results may not necessarily reflect the real situation. If more accurate measurements are needed, the probe can capture the time difference from the time after the SYN packet sent by a server when establishing a TCP connection to the time after it received the TCP SYNACK sent back by the peer.

A more accurate and convenient method for later analysis is to use professional network equipment to capture and analyze messages at the port of the network equipment.

Fourth, performance positioning and optimization

(1) what are the tuning ideas for disk IO contention?

Typical question: what are the tuning ideas for scenarios where the main contention is IO? What are the main techniques or methods?

First, it is necessary to find out whether the IO contention is caused by the excessive amount of IO at the application level, or whether the system level cannot carry these IO amounts.

If there are too many unnecessary reads and writes at the application level, solve the application problem first.

Example 1: the buffer used for sort in the database is too small, when it is used as sort, there is a lot of data exchange between memory and disk, so this kind of IO can be reduced or avoided by expanding the memory of sort buffer.

Example 2: from the application point of view, some logs are not important at all and do not need to be written, so the log level can be lowered or even not recorded, and hint "no logging" can be added at the database level.

Second, the analysis train of thought of storage problem

Storage IO problems may occur in all aspects of the IO link, and analyze which link in the host / network / storage causes the IO bottleneck.

IO has gone through a long chain from application-> memory cache-> block device layer-> HBA card-> driver-> switched network-> storage front-end-> storage cache- > RAID group-> disk.

Segment-by-segment analysis is required:

1. Host side: application-> memory cache-> Block device layer → HBA card-> driver

2. Network side: switching network

3. Storage side: storage front end-"Storage cache-" RAID group-"disk"

Analysis ideas:

1. Host side

When the delay observed on the host side is very large and the delay on the storage side is small, it may be a problem on the host side or the network.

The host is the initiator of Icano, which is first determined by the business software and operating system software and hardware configuration of the host. For example, the queue_depth queue length parameter (queue_depth) introduced in the "Service queue full" section, and of course, there are many other parameters (such as: the maximum I driver O that can be sent to storage, the size of the DMA memor area of the fiber card, the number of concurrency of block devices, the number of HBA card concurrency).

If the troubleshooting is completed and the performance problems still exist, it is necessary to troubleshoot the performance problems on the networking, link and storage side.

2. Network side

When the delay observed on the host side is very large, the delay on the storage side is small, and there is no problem on the host side, the performance problems may occur on the link.

The possible problems are: bandwidth bottleneck, improper switch configuration, switch failure, multipath routing error, electromagnetic interference of lines, damaged optical fiber lines, loose interfaces and so on. The bandwidth reaches the bottleneck, the switch is not configured properly, the error of multipath routing, the electromagnetic interference of the line and so on.

3. Storage side

If the delay on the host side is large and the delay on the storage side is large and the difference is small, the problem may occur on the storage. First of all, we need to understand the IO model carried by the current storage side and the configuration of storage resources, and collect performance data from the storage side, and locate the performance problems according to the Imax O path.

Common reasons include upper limit of hard disk performance, upper limit of mirror bandwidth, storage planning (e.g. stripe is too small), hard disk domain and storage pool partition (e.g. low speed disks), thin LUN or thick LUN, cache settings corresponding to LUN (cache size, cache type, memory or SSD)

IO's Qos-limited disk IO bandwidth, LUN priority setting, too small number of storage interface modules, RAID partition (such as RAID10 > RAID5 > RAID6), stripe width, stripe depth, configuration snapshot, cloning, remote replication and other value-added features slow down performance, whether refactoring, balancing and other operations are in progress, high CPU utilization of the storage controller, short-term performance problems caused by unformatted LUN, Parameters for cache flushing to disk (high and low water level settings) Even the data is in the center or edge of the disk, and so on.

Each step has some specific methods, commands, and tools to view the performance, which I will not repeat here.

(2) what are the tuning ideas and suggestions for the application of low-latency transactions and high-speed transactions in IO?

Typical question: with regard to the application requirements of low-latency transactions and high-speed trading encountered in some securities industry recently, what are the ideas and suggestions that can be tuned in the path of IO model?

For low-latency transactions, you can analyze whether the business needs to persist the log, or how secure it is, to decide what kind of IO to use.

1. From a business perspective

For example, if you don't need to keep a log in business, you don't have to write IO.

Or if the preservation level is not high, you can write only one piece of data, and for logs with a higher preservation level, it is generally necessary to write double or more.

two。 From the perspective of storage media

1) all SSD can be used.

2) or use SSD as the secondary cache (primary cache is memory)

3) or use storage classification in the storage server (migrate hot data to SSD, SAS and other hard drives with better performance)

4) RAMDISK can be used (memory is used as disk)

5) increase the cache of the storage server corresponding to LUN

3. From the perspective of configuration

For LUN stored on ordinary disk, you can set a reasonable RAID mode (such as RAID10) to adapt to your business scenario.

The depth of striping is greater than or equal to the size of an IO, and there is enough width to support concurrent writing.

The angle of the 4.IO path

High-speed networking technology is adopted instead of low-speed methods such as iSCSI.

(3). Ideas and methods for locating network IO problems.

Like disk IO, network IO also needs to be found and analyzed in segments. Through the network packet capture and analysis tools, diagnose the network delay, packet loss and other abnormal situations in which section, and then specific analysis.

At the same time, grasping the IPtrace on the host side can help diagnose many network problems.

Resource metrics of performance metrics-preliminary diagnosis of network IO---trace view

Word count 2817 read 4748 comments 0 like 1

If from the application level or ping and other means to locate the network delay, jitter, packet loss, interruption and other abnormal situations, it needs in-depth diagnosis and analysis.

At this time, the best method is to use professional network packet capture equipment for network packet capture, and use the manufacturer's corresponding tools for analysis and diagnosis. Not only does it not affect the performance of the server itself, but also some diagnostic conclusions can be drawn relatively quickly. At present, there are many manufacturers and tools in this area in the market.

However, this section will introduce another means of preliminary analysis of the problem-iptrace. A rough judgment of the problem can be made by monitoring the iptrace log on the business system and then analyzing it through the wireshark tool.

1. Iptrace crawl

Give examples to illustrate

Enable monitoring: startsrc-s iptrace "- a-s target IP-b-S 1500-L 1073741824 / var/trace/iptrace.out"

The meaning of this command is: iptrace records the information transmitted by the local machine and the target machine in both directions, the maximum limit for fetching data packets is 1500 bytes, and the maximum log record is 1073741824 bytes (1G size).

Turn off monitoring: stopsrc-s iptrace

2. Introduction to Wireshark

Wireshark is a tool used by the windows platform to view network port packets. The core function is to quickly screen the information you need and quickly locate the problem. Use Wireshark to open the iptrace.out file as shown below:

No-the serial number of intercepted network packets

Time-time

Source-data source IP

Destination-Target IP

Length-the total byte length of the message packet

Protocol-message packet protocol type

Info-basic information about message packages

The following is the corresponding OSI seven-layer model

Frame: Overview of data frames at the physical layer

Ethernet II: data link layer Ethernet frame header information

Internet Protocol Version 4: Internet layer IP packet header information

Transmission Control Protocol: segment header information of the transport layer TCP

WebSphere MQ: application layer information, here is MQ

Common filter commands

You can filter and display a list of network packets according to your needs. The common filter criteria are as follows:

1 filter frame.len== by packet length (value of Length)

2 filter ip.src eq 10.x.x.x by data source ip

3 filter source IP and protocol type ip.src eq 10.x.x.x & & mq at the same time

4 filter mq & & tcp according to the required protocol type

3. Analyze an example

After opening the iptrace file, first look at the black section on the right, which is the network transfer that wireshark thinks is a problem.

Taking an example in the author's practice, the network delay in the system environment is very unstable. After the crawl iptrace is opened with wireshark, a large number of black stripes are found, and the error-free time period can hardly be found.

There are a variety of problems, such as:

A large number of tcp keep-alive ack/ tcp keep-alive, the SEQ value of the next reply packet is not equal to a large number of ACKRST ACK, a large number of RetransmissionPrevious segment uncapturedACKed Unseen segment, a lot of Dup ackDestination unreachable.

In a few minutes of trace, there are so many problems, but also drunk.

3.1 A large number of tcp keep-alive / tcp keep-alive ack

There are 2 tcp keep-alive / tcp keep-alive ack packets in almost 10 packets, which takes up a lot of network bandwidth.

First of all, there is a slight problem with the setting of keep alive-related parameters on the server.

No-a | grep tcp

Tcp_keepcnt = 8

Tcp_keepidle = 20

Tcp_keepinit = 50

Tcp_keepintvl = 2

The meaning of these parameters is: if the tcp connection is idle for 10 seconds and there is no message transmission (tcp_keepidle = 20, unit is 0.5 seconds), then start sending probe (tcp alive). If the probe is not successful, it will be sent every second (tcp_keepintvl = 2). If it is sent 8 times in a row, the other party will not respond, so turn off the tcp connection.

It is true that this probe is a little more frequent than the default value, but imagine that if a probe is successful, there is no need to probe later. Why are there so many probes?

3.2 the SEQ value of a large number of reply packets is not equal to ACK

The seq value replied normally = equal to the ACK value of the previous request, while the SEQ value in the ACK of almost all keep alive seen in trace = the ACK+1 of the initiator of the keep alive. That is, after each detection, the reply of the other party is not the expected result, so the subsequent detection continues, resulting in a large number of keep alive packets.

The Seq and Ack fields are key parts of the TCP reliable transport service, and Seq is the serial number of the network packet (TCP treats the data as an ordered stream of bytes, and TCP implicitly numbers each byte of the data stream). Ack is the expected sequence number of the next reply packet (that is, the Seq value).

3.3 RST ACK

RST may be that B sends too much, and A tells B not to send the package until A notifies B again that the package can be sent. Or A closes the abnormal connection and empties the cache.

Since only one RST ACK was found, there was no in-depth analysis.

3.4Mass Retransmission retransmission

Retransmission means poor network quality, which may be caused by congestion, target reply delay, packet loss of network equipment, poor quality of network cable, loose interface, electromagnetic interference and so on.

3.5 Previous segment uncaptured

The previous segment failed to capture, and when the sequence number value of the currently received message is higher than the next expected sequence number value of the connection, it indicates that one or more previous network packets have not been captured. Previous segment uncaptured also often appears in the normal communication process, and the frequency is not many, so there is no in-depth analysis.

3.6 ACKed Unseen segment

The previous segment of Tcp ACKed unseen segment network packet reply was not intercepted, which is similar to the problem that the previous segment of Tcp Previoud segment not captured failed to capture. Because Previous segment uncaptured often appears in the normal communication process, and does not appear many times, so there is no in-depth analysis.

3.7 large amount of Dup ack

Dup ACK does not receive the other party's reply (ACK) and requires the other party to repeat the reply. More than 3 times of Dup ACK will cause retransmission and slow down the transmission speed. After analysis, it is found that the packets sent by the other server in the system environment are out of order at the interface on the side of the core switching mainframe.

3.8 Destination unreachable

The goal is out of reach. Since this error occurred only once, there was no in-depth analysis.

To sum up, we can see a lot of problems from the results of iptrace, but after preliminary analysis, the main problems boil down to 1) the SEQ value of the peer-to-peer reply TCP packet is not the expected value, and 2) the network quality is poor. The next step is to find out the peer node of the problem and use the network packet grabbing device to determine which section of the network is of poor quality.

(4) cases that are misjudged as IO problems

In many cases, the response time of the application is very slow, which seems to be an IO problem, but it is not. Here are two examples.

1. [case sharing]: Oracle buffer waiting accounts for the majority of the total time.

In one scenario, oracle's awr reports that the first place of the top10 event is: buffer busy waits

Buffer busy waits is a relatively general wait, which is caused by session waiting for a buffer, but it is not clear what kind of buffer it is. For example, log sync waiting can also cause buffer busy wait.

This is a joint and several indicator, the analysis is regardless of for the time being, we need to see what his near problem event is.

The second place for awr to report top10 events is enq:TX-index contention

The approaching event here is enq:TX-index contention. Index contention is often caused by index split caused by a large number of concurrent INSERT, that is, the binary tree grows as the index is constantly updated. When you need to split, when you split, other session needs to wait. (the analysis here requires some knowledge of the database.)

In the subsequent tuning process, the index is partitioned to avoid competition. After adjustment and retest, both Index contention and Bufferbusy wait disappeared from the top10 event.

This kind of database-related wait events are very common, which seems to be waiting for IO, but there is actually a problem with the planning and design of the database.

2. [case sharing]: ping increases intermittently with delay.

The response time of a business system is very unstable, which is composed of two types of servers, which can be understood simply as An and BMaga An as the client, B as the server, and the response time of the business at An is very unstable.

Step one:

Trace the cause from various resources (CPU, memory, network IO, disk IO). Finally, it is found that the direct network delay between An and B is very unstable. A ping B, in the LAN environment, the delay is supposed to be between 0ms-1ms, but we find that there is a delay of 100-200ms in a short period of time at the peak of business. Ping has a delay of 30-40ms even when there is no business.

Step 2:

Well, let's start to locate the network problem.

Start checking the Internet. After a series of measures such as changing the physical port of A, changing the switch, changing the network cable, changing the physical port of the opposite end, and so on, it is found that the problem still exists.

Step 3:

Using network detection equipment, grab packets from the ports on both sides of the switch, and analyze where the time spent in the process of establishing a tcp connection is consumed. After analysis, it is found that the delay of 200ms is measured in B. That is, a tcp connection establishment process consumes little time on the A side and the switch side.

Step 4:

Multiple partitions on the B side share a physical machine. Guess whether it is caused by too many partitions. When only one LPAR is started, there is no ping delay, when part of the LPAR is started, the delay is smaller, and when all LPAR is started, the ping delay is larger.

Root cause of the problem:

At this point, the problem came to light. It turned out that there was a delay in the ping of B replying to A due to too many partitions. So why did this happen? The CPU resources on a physical machine are limited (3 in this environment). Even if there is only one LPAR, the N processes above will take turns to use CPU, not to mention that M LPAR,MN processes take turns to use the three CPU. Of course, the scheduling algorithm is not that simple. Here is only a theoretical explanation.

Assuming that each CPU time slice is 10ms, in extreme cases, a process needs to wait until (MN-1) * 10 (ms) / 3 for CPU.

Moreover, it is estimated that so many LPAR processes polling the cache data in CPU,CPU has long been squeezed out, and reloading is more time-consuming.

Coping methods:

Previously, LPAR also set up guaranteed CPU (guarantee of MIPS quantity), but only quantity but not quality (the CPU cache problem mentioned above, that is, affinity problem).

The way to deal with it is to allocate important LPAR to dedicated CPU to ensure the quality of CPU resources and to ensure that there are as few customers polling CPU as possible, so that the data in CPU cache is not cleared away as far as possible. It is proved that the ping delay basically disappears and the method is effective.

This case seems to be a network problem, but it is actually a problem of resource scheduling.

By the way, in many cases, the instability of client response time is caused by the instability of server-side service capabilities. In general, it is caused by the problems of application and database. In this case, there is an intermittent delay in the response to ping at the operating system level, which can easily mislead our analysis and judgment.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.