Summary of slow-troubleshooting problems in MySQL database (compiled from "MySQL troubleshooting" Ye Jinrong) 07/01 Update SLTechnology News&Howtos

Summary of slow-troubleshooting problems in MySQL database (compiled from "MySQL troubleshooting" Ye Jinrong)

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

1. Common bottlenecks

1) SQL is inefficient

(2) Improper configuration of options

(3) Interview questions soar

(4) Low hardware performance

(5) Other processes compete for resources

2. How to confirm that MySQL has a bottleneck

top/free/vmstat/sar/mpstat Confirm

- Confirm CPU consumption ratio of mysqld process

- Confirm if CPU consumption of mysqld process is %user or %sys high

- Confirm if physical memory is insufficient

- Confirm if swap occurs

- Confirm if there are a lot of interrupts (or uneven interrupts) on the CPU

2.1 View CPU-top

2.2 View Memory-Free

free -m

Free related commands

2.3, View IO, CPU, Memory, Swap Partition, Interrupt-vmstat

vmstat -S m 1

As can be seen from the above, CPU and I/O pressure is not small.

Output Description:

Interpretation of the vmstat command:

1）procs

r This column shows how many processes are waiting on the CPU

Column b shows how many processes are uninterruptedly dormant (usually meaning they are waiting for I/O, such as disk, network, user input, and so on).

2）memory

swpd The size of virtual memory used (showing how many blocks have been swapped out to disk (page swap)). If it is greater than 0, it means that your machine is running low on physical memory. If it is not the cause of the program memory leak, then you should upgrade the memory or migrate the memory-consuming task to another machine.

free Size of free physical memory

buff How many blocks are being used as buffers

cache How much is being used as a cache for the operating system

swap shows page swap activity: how many blocks are being swapped in (from disk) and out (to disk) per second

si reads the size of virtual memory from disk per second. If this value is greater than 0, it means that physical memory is not enough or memory leaks. It is necessary to find a memory-consuming process to solve it.

so The size of virtual memory written to disk per second. If this value is greater than 0, the same as above.

In general, the values of si and so are 0. If the values of si and so are not 0 for a long time, it means that the system memory is insufficient, and it is necessary to consider whether to increase the system memory.

4) IO shows how many blocks are read (bi) and written (bo) from the block device

bi The number of blocks received per second by a block device. Block devices here refer to all disks and other block devices on the system. The default block size is 1024 bytes. I don't have any IO operations on my own machine, so it's always 0. However, I've seen it on machines that handle large amounts of copied data (2-3T). The disk write speed is almost 140M per second.

bo The number of blocks a device sends per second, e.g. we read a file, bo must be greater than 0. bi and bo should be close to 0, otherwise IO is too frequent and needs to be adjusted.

The bi+bo reference value set here is 1000. If it exceeds 1000 and the wa value is relatively large, it indicates the system disk IO performance bottleneck.

5) system shows the number of interrupts (in) and context switches (cs) per second

in Number of CPU interrupts per second, including time interrupts

cs Number of context switches per second, for example, we call system functions, context switches, thread switches, process context switches, this value should be as small as possible, too large, to consider reducing the number of threads or processes. System calls are also, every time we call system functions, our code will enter the kernel space, resulting in context switching, this is very resource-intensive, but also try to avoid frequent calls to system functions. Too many context switches means that most of your CPU is wasted on context switches, resulting in less time for CPU to do serious work, CPU is not fully utilized, it is not desirable.

The higher these two values are, the more CPU time you see consumed by the kernel.

6）CPU

us user CPU time. When the value of us is relatively high, it means that the user process consumes more CPU time, but if it is used for more than 50% for a long time, then we should consider optimizing the program algorithm or other measures.

sy system CPU time, if too high, means that the system call time is long, such as IO operations frequently.

If sys is too high, it means that the system kernel consumes too many cpu resources. This is not a benign performance. We should check the reason.

id idle CPU time, in general, id + us + sy = 100, generally I think id is idle CPU usage, us is user CPU usage, sy is system CPU usage.

wa Wait IO CPU time.

When Wa is too high, it indicates that io wait is serious, which may be caused by a large number of random accesses to the disk, or it may be a bottleneck in the bandwidth of the disk.

st column is generally not concerned, percentage of time virtual machines consume

2.4 View CPU and IO-sar

view CPU

sar -u 1

Output Item Description:

CPU: all indicates that the statistics are the average of all CPUs.

%user: Displays the percentage of total CPU time running at the user level (application).

%nice: Percentage of CPU time consumed in user mode by processes whose scheduling priorities were changed by nice

%system: Percentage of total CPU time spent running at kernel level.

%iowait: Displays the percentage of total CPU time spent waiting for I/O operations.

%steal: Percentage of hypervisor waiting for virtual CPU to service another virtual process.

%idle: Displays the percentage of CPU idle time consumed by the total CPU time.

1. If %iowait is too high, the hard disk has an I/O bottleneck

2. If the %idle value is high but the system response is slow, it is possible that the CPU is waiting to allocate memory, and the memory capacity should be increased.

3. If the %idle value is consistently lower than 1, the CPU processing power of the system is relatively low, indicating that the CPU is the resource that needs to be solved most in the system.

View IO status

tps: Number of I/Os per second from physical disks. Multiple logical requests are combined into a single I/O disk request, the size of a single transfer is indeterminate.

rd_sec/s: Number of sector reads per second.

wr_sec/s: Number of sectors written per second.

avgrq-sz: Average data size (sectors) per device I/O operation.

avgqu-sz: Average length of disk request queue.

await: Average elapsed time per request from request to disk operation to system completion, including queue wait time, in milliseconds (1 second =1000 milliseconds).

svctm: The average time the system takes to process each request, excluding time spent in the request queue.

%util:I/O requests as a percentage of CPU, the higher the ratio, the more saturated.

1. Low avgqu-sz values result in higher utilization of equipment.

2. When the %util value is close to 1%, the device bandwidth is full.

2.5 View interruptions

mpstat -P ALL -I SUM 1 100

3.1 Show which threads are running

show processlist; or show full processlist

Status 1: Sending data

mysql> show processlist\G

As can be seen from the above, it is a long-term sending data

Sending data: Indicates the status of data read from the engine layer and returned to the Server side

Reasons for long-term existence:

(1) No proper index, low query efficiency

(2) Read a lot of data, read slowly

(3) High system load, slow reading

Solution:

(1) Add appropriate index

(2) Or rewrite SQL to improve efficiency

(3) Increase LIMIT to limit the amount of data read at a time

(4) Check & upgrade I/O device performance

Waiting for table metadata lock

show processlist; or show full processlist

As can be seen from the above: long wait for MDL lock

Reason:

(1) DDL is blocked, which in turn blocks his subsequent SQL

(2) SQL before DDL is not finished for a long time

Solution:

(1) Improve the efficiency of each SQL

(2) Kill long-running SQL

(3) Place DDL at low points such as midnight

(4) Execute DDL with pt-osc

State 3: Sleep

As can be seen from the above: Sleep

It looks harmless, but it could be a big pest.

(1) Number of connections occupied

(2) Memory consumption is not released

(3) There may be row locks (or even table locks) that are not released

Solution:

(1) Reduce timeout appropriately

(2) Active Kill timeout inactive connection

(3) Regularly check lock and lock waiting

(4) The pt-kill tool can be used

State 4: Other States

(1)Status: Copy to tmp table

Reason:

1) Executing alter table to modify table structure, temporary table needs to be generated

2) It is recommended to perform at night trough, or use pt-osc

(2)Copying to tmp table [on disk]

Creating tmp table

Common in cases where group by has no index

Need to copy data to temporary table [memory/disk]

The Using temporary keyword appears in the execution plan

It is recommended to create appropriate indexes and eliminate temporary tables

（3) Creating sort index

Common in case of order by without index

Filesort sorting is required

The Using filesort keyword appears in the execution plan

It is recommended that you create a sort index

(4) Other states

Waiting for global read lock

Waiting for query cache lock

Waiting for table level lock

Waiting for table metadata lock

3.2 Check lock- mysql lock troubleshooting process

mysql> select * from information_schema.innodb_trx;

mysql> select * from information_schema.innodb_locks;

View lock waiting

mysql> select * from information_schema.innodb_lock_waits;

mysql> select * from sys.innodb_lock_waits;

mysql lock checking process

1) Check the current lock waiting situation

INNODB_TRX lock situation:

mysql> SELECT * FROM INNODB_TRX\G;

2) Look at the correlation between lock waiting and lock holding

mysql> SELECT * FROM INNODB_LOCK_WAITS\G;

3) See why the lock is waiting

mysql> SELECT * FROM INNODB_LOCKS\G;

3.3 Check the status of Innodb.

show engine innodb status\G

View MySQL thread status

3.4 View slow logs 4. How to prevent 4.1. Before business goes online

(1) Eliminate garbage SQL in advance,

(2) In development or pressure testing environments

Lower the value of long_query_time, even to 0

Open log_queries_not_using_indexes

Analyze slow query logs and eliminate potential SQL hazards

4.2 with better facilities.

(1) CPU is faster and more cores

(2) Memory is faster and larger

(3) Use faster I/O devices

(4) Use better network equipment

4.3 Disk File System and Scheduling Algorithm

(1)xfs/ext4 file system

(2) Use noop/deading io scheduler

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.