Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Using sys schema to solve a weird sentence hang problem

2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Share

Shulou(Shulou.com)06/01 Report--

Walker science and technology Luo Xiaobo

Guide reading

1. Background of the story

2. Reappearance and analysis

3. Solution

4. Summary 1. Before the story begins, list the operating environment information of the database. Operating system: redhat 7.2x8room64 file system: xfs database version: MySQL 5.7.17 host configuration:

* CPU:32 vcpus

* memory: 128 GB

* disk: single disk intel SSD 320g (only data and binlog for mysql) the main configuration parameters are set: innodb_buffer_pool_size = 96g _ innodbrecords _ log _ blank filesizes = 2g _ innodbrecords _ flushcards filesize = 2g _ method = _ Binlog_rows_query_log_events=ON this instance is connected to a highly available mechanism: when the HA heartbeat detection mechanism 60s discovers that the instance continues to fail in detection (once every 5 seconds), directly try to shut down the failed MySQL instance and host, and switch the storyline with high availability: a customer reported a strange problem a few days ago, and a MySQL instance is accessible and inaccessible. Looking at the relevant logs, it is found that the instance repeatedly performs high availability switchover. During the preliminary troubleshooting, it is found that the disk load is high at the time of server failure, but it is not fully utilized. Before the high availability switch, there are some insert running in the database. Select statements and a large number of heartbeat detection statements, while heartbeat detection statements are independent of the business table, and heartbeat detection statements only update a row of data, how can a bunch of heartbeat detection statements be blocked? After a lot of trouble, finally found the reason, the specific process, please see below! 2. Reappearance and analysis

After a brief analysis, although business SQL and heartbeat SQL operate different tables, there is no lock waiting problem, but from the results of show processlist;, the execution time of business SQL is the longest, and new heartbeat statements are blocked constantly (of course, the switching mechanism of HA must be turned off first, otherwise it may be switched after 60s when the operation is repeated later), it seems that it is the business SQL that blocks heartbeat SQL. And the data volume of these business SQL operations is up to 30 million rows (because it is insert … Select statement, so the amount of data you can see from slow logs or innodb_ trx tables or sys.session views is actually as much as 6KW, which doubles. But the specific connection between the two cannot be seen intuitively. We have followed the following steps to reproduce:

2.1. First of all, prepare the reproduction environment according to the online environmental standards, and build a testing machine.

2.2. Create a heartbeat table and insert a row of data

Click (here) to collapse or open

Root@localhost: test:33: > CREATE TABLE `xx_ heartbeat` (

`server_ id` int (10) unsigned NOT NULL

`hb_ time`datetime NOT NULL

PRIMARY KEY (`server_ id`)

) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin

Root@localhost: test:33: > insert into xx_heartbeat values (@ @ server_id,now ())

Root@localhost: test:33: > select * from xx_heartbeat

+-+ +

| | server_id | hb_time |

+-+ +

| | 3306103 | 2017-10-11 12:33:26 |

+-+ +

2 rows in set (0.00 sec) 2.3. Turn on waiting event collection in performance_schema (to save trouble, use the function operation under sys schema instead of using update statement to directly modify the configuration table of performance_schema)

Click (here) to collapse or open

Root@localhost: test:34: > use sys

Database changed

Root@localhost: sys:36: > call ps_setup_enable_instrument ('wait')

+-+

| | summary |

+-+

| | Enabled 303 instruments |

+-+

1 row in set (0.04 sec)

Query OK, 0 rows affected (0.04 sec)

Root@localhost: sys:36: > call ps_setup_enable_consumer ('wait')

+-+

| | summary |

+-+

| | Enabled 3 consumers |

+-+

1 row in set (0.01 sec)

Query OK, 0 rows affected (0.01 sec) 2.4. Open 4 MySQL session connections and execute insert on 4 different tables in turn. Select operation, the amount of table data is all 3KW (the test data are 4 tables of 3KW data created by sysbench)

Click (here) to collapse or open

# session 1

ADMIN@127.0.0.1: (none): 15: > use sbtest

Database changed

ADMIN@127.0.0.1: sbtest:37: > insert sbtest1 (kmaec padpad) select kpje padpad from sbtest1

# session 2

ADMIN@127.0.0.1: (none): 15: > use sbtest

Database changed

ADMIN@127.0.0.1: sbtest:37: > insert sbtest3 (kmaec padpad) select kpje padpad from sbtest3

# session 3

ADMIN@127.0.0.1: (none): 18: > use sbtest

Database changed

ADMIN@127.0.0.1: sbtest:37: > insert sbtest2 (kmaec padpad) select kpje padpad from sbtest2

# session 4

ADMIN@127.0.0.1: (none): 18: > use sbtest

Database changed

ADMIN@127.0.0.1: sbtest:37: > insert sbtest4 (kmam cpadt) select kpje cpadt from sbtest42.5. Update the heartbeat table every 5 seconds. After tens of minutes, it is found that the heartbeat SQL is blocked. If the HA switching mechanism is not off, the blocked heartbeat SQL will be switched over 60s. After the HA switching mechanism is turned off, the automatic heartbeat will stop detection, so here we manually simulate a heartbeat SQL and use the script loop to update every 5 seconds. The script prints both show processlist and sys.session view information when it finds blocking. Script link: http://5d096a11.wiz03.com/share/s/1t2mEh0a-kl_2c2NZ33kSiac2wQRmx1ykAEE23yLIT3aKVmx)

Click (here) to collapse or open

# heartbeat SQL statement (this sentence can be executed manually every 5 seconds, but it is sure that the hands will be sore. However, in order to facilitate everyone to read, the continuous detection with script is used in the recurrence process, and the heartbeat statement is executed manually when it is found to be blocked, in order to facilitate interception)

ADMIN@127.0.0.1: test:01: > update xx_heartbeat set hb_time=now () where server_id=@@server_id

Query OK, 0 rows affected (0.00 sec)

Rows matched: 1 Changed: 0 Warnings: 0

ADMIN@127.0.0.1: test:01: > update xx_heartbeat set hb_time=now () where server_id=@@server_id2.6. Immediately open a new session connection to see what the database is doing, and you can find the insert for 4 sessions. Select statement, and heartbeat SQL update xx_heartbeat...

Click (here) to collapse or open

ADMIN@127.0.0.1: sys:25: > show processlist

+- -+

| | Id | User | Host | db | Command | Time | State | Info |

+- -+

| | 25 | ADMIN | 127.0.0.1 Query 35458 | sys | Query | 0 | starting | show processlist |

| | 26 | ADMIN | 127.0.0.1 Query 35459 | sbtest | Query | 1353 | query end | insert sbtest1 (KMagazine cMagnePad) select KMagazine cMagpad from sbtest1 |

| | 27 | ADMIN | 127.0.0.1 Query 35460 | sbtest | Query | 1352 | query end | insert sbtest3 (KMagazine cMagnePad) select KMagazine cMagpad from sbtest3 |

| | 30 | ADMIN | 127.0.0.1 Query 35463 | sbtest | Query | 1352 | query end | insert sbtest2 (KMagazine cMagnePad) select KMagazine cMagpad from sbtest2 |

| | 31 | ADMIN | 127.0.0.1 Query 35464 | sbtest | Query | 1351 | query end | insert sbtest4 (KMagnec padpad) select KMagazine padpad from sbtest4 |

| | 52 | ADMIN | 127.0.0.1 Query 35485 | test | Query | 62 | query end | update xx_heartbeat set hb_time=now () where server_id=@@server_id |

| | 151 | qfha | 10.10.40.167 Query 51328 | NULL | Query | 562 | starting | SHOW BINARY LOGS |

.

| | 160 | qfha | 10.10.40.167 Query 51337 | NULL | Query | 22 | starting | SHOW BINARY LOGS |

+- -+

16 rows in set (0.00 sec) 2.7. Check the server load.

Click (here) to collapse or open

# top

Top-22:51:21 up 3 days, 6:54, 9 users, load average: 5.74,3.39,1.70

Tasks: 980 total, 1 running, 979 sleeping, 0 stopped, 0 zombie

Cpu (s): 0.4%us, 0.4%sy, 0.0%ni, 97.3%id, 1.9%wa, 0.0%hi, 0.0%si, 0.0%st

Mem: 131804428k total, 129297568k used, 2506860k free, 282052k buffers

Swap: 7974908k total, 106080k used, 7868828k free, 33743992k cached

PID USER PR NI VIRT RES SHR S CPU MEM TIME+ COMMAND

22080 mysql 20 0 86.1g 71g 12m S 23.8 57.0 602:42.90 mysqld

29121 qemu 20 0 17.5g 8.2g 6160 S 6.9 6.5 140:51.21 qemu-kvm

20475 root 20 0 0 0 S 0.7 0.0 0:28.71 xfs-cil/dm-6

18272 root 20 0 15692 1952 924 R 0.3 0.0 0:00.33 top

1 root 20 0 19356 1408 1228 S 0.0 0.0 0:02.14 init

2 root 20 00 00 S 0.0 0.0 0:00.01 kthreadd

3 root RT 00 00 S 0.0 0.0 0:00.28 migration/0

.

# free

# free-m

Total used free shared buffers cached

Mem: 128715 121616 7099 0 275 28429

-/ + buffers/cache: 92911 35803

Swap: 7787 88 7699

# vmstat

[root@10-10-66-229] # vmstat 1 10

Procs-memory--swap---io-----system---cpu-

R b swpd free buff cache si so bi bo in cs us sy id wa st

2 1 90392 2484712 281996 33758712 0 0 0 266326 14803 19717 9 1 87 3 0

7 1 90392 2454332 281996 33787004 00 0 263063 14746 18893 10 1 87 2 0

4 10 90392 2427820 281996 33814484 0 0 0 266977 13522 18440 10 1 87 20

40 90392 2399888 281996 33840928 0 0 0 254064 14727 18820 10 1 87 20

4 1 90392 2373260 281996 33865532 0 0 0 255281 15479 19072 10 0 88 2 0

4 1 90392 2344072 281996 33894036 0 0 16 261650 13296 18970 10 0 87 2 0

5 0 90392 2316840 281996 33920892 0 0 0 253947 12292 18442 10 1 88 20

4 0 90392 2289436 281996 33947748 0 0 0 273755 13894 19790 10 1 87 3 0

4 0 90392 2258064 281996 33977544 0 0 0 265602 12351 18488 10 0 87 2 0

5 1 90392 2230940 281996 34005292 00 0 269967 12720 19439 90 88 20

.

# iostat

[root@10-10-66-229] # vmstat-x 1 10

.

Avg-cpu:% user nice% system% iowait% steal% idle

10.28 0.00 0.50 2.51 0.00 86.72

Device: rrqm/s wrqm/s rUnip s wdeband s rsec/s wsec/s avgrq-sz avgqu-sz await svctm% util

Sdc 0.00 1031.00 0.00 6041.00 0.00 537654.00 89.00 13.20 2.19 0.16 94.10

.

Avg-cpu:% user nice% system% iowait% steal% idle

9.83 0.00 0.53 2.54 0.00 87.10

Device: rrqm/s wrqm/s rUnip s wdeband s rsec/s wsec/s avgrq-sz avgqu-sz await svctm% util

Sdc 0.00 891.00 0.00 6034.00 0.00 524171.00 86.87 12.78 2.12 0.16 94.70

.

2.8. From the information seen in steps 2.6and 2.7, the CPU and memory of the server are not the bottleneck. Although the disk load is high and occasionally uses swap and about 20.3% IOWAIT, it does not completely block a single row of updated heartbeat SQL. Moreover, the business SQL and heartbeat SQL are different tables, and there is no problem that table locks and transaction locks block each other. (checked with sys.innodb_lock_waits view The lock wait information is empty). From the show processlist information of mysql, there is no connection between these SQL. We can only see that these SQL have been in the state of query end for a long time, which is obviously abnormal. In addition, there are some show binary logs statements, which are used by the monitoring system to monitor the size of binlog, and this statement is increasing, which seems to be a problem with the access of binlog.

2.9. At this point, the usual troubleshooting methods are overshadowed by the fact that we enable sys schema, open a new session connection, and use the session view to find out what the SQL that these active sessions are doing.

Click (here) to collapse or open

ADMIN@127.0.0.1: sys:25: > select * from session where connexidshaped connectionroomid ()\ G

* * 1. Row *

Thd_id: 60

Conn_id: 26

User: ADMIN@127.0.0.1

Db: sbtest

Command: Query

State: query end

Time: 6355

Current_statement: insert sbtest1 (kMagnec Pad) select kMagnec Pad from sbtest1 # Business SQL for operating sbtest1 tables

Statement_latency: 22.58 m

Progress: NULL

Lock_latency: 1.25 ms

Rows_examined: 60000000

Rows_sent: 0

Rows_affected: 0

Tmp_tables: 1

Tmp_disk_tables: 1

Full_scan: YES

Last_statement: NULL

Last_statement_latency: NULL

Current_memory: 0 bytes

Last_wait: wait/synch/cond/sql/MYSQL_BIN_LOG::COND_done # found that a session with a connection ID of 26 was waiting for this event when executing an insert...select statement (binlog's COND_done mutex)

Last_wait_latency: Still Waiting # the field value of "Still Waiting" indicates that the thread is currently waiting for the event displayed in the last_wait field

Source: the number of source files and lines of code for waiting events in binlog.cc:1949 # here

Trx_latency: NULL

Trx_state: NULL

Trx_autocommit: NULL

Pid: 4571

Program_name: mysql

* 2. Row * *

Thd_id: 61

Conn_id: 27

.

Current_statement: insert sbtest3 (kmae cjingpad) select kje cje padpad from sbtest3

Statement_latency: 22.57 m

.

Last_wait: wait/synch/mutex/sql/MYSQL_BIN_LOG::LOCK_log # found that a session with a connection ID of 27 was waiting for this event when executing an insert...select statement (binlog's LOCK_log mutex)

Last_wait_latency: Still Waiting # interprets the same fields as the first line of information

Source: binlog.cc:8587 # interprets the same fields as the first line of information

.

* 3. Row * *

Thd_id: 64

Conn_id: 30

.

Current_statement: insert sbtest2 (kMagnec Pad) select kMagnec Pad from sbtest2 # Business SQL for operating sbtest2 tables

Statement_latency: 22.57 m

.

Last_wait: wait/synch/cond/sql/MYSQL_BIN_LOG::COND_done # found that a session with a connection ID of 30 was waiting for this event (binlog's COND_done object) while executing the insert...select statement

Last_wait_latency: Still Waiting # interprets the same fields as the first line of information

Source: binlog.cc:1949 # interprets the same fields as the first line of information

.

* * 4. Row *

Thd_id: 65

Conn_id: 31

User: ADMIN@127.0.0.1

Db: sbtest

Command: Query

State: query end

Time: 6353

Current_statement: insert sbtest4 (kMagnec Pad) select kMagnec Pad from sbtest4 # Business SQL for operating sbtest4 tables

Statement_latency: 22.55 m

Progress: NULL

Lock_latency: 1.55 ms

Rows_examined: 60000000

Rows_sent: 0

Rows_affected: 0

Tmp_tables: 1

Tmp_disk_tables: 1

Full_scan: YES

Last_statement: NULL

Last_statement_latency: NULL

Current_memory: 0 bytes

# found that a session with a connection ID of 31 is waiting for this event when executing an insert...select statement (performing a file IO operation of binlog, indicating that you are writing from binlog cache to binlog file, but this is only a buffer to the file system, not sync)

Last_wait: wait/io/file/sql/binlog

Last_wait_latency: Still Waiting # interprets the same fields as the first line of information

Source: mf_iocache.c:1566 # interprets the same fields as the first line of information

Trx_latency: NULL

Trx_state: NULL

Trx_autocommit: NULL

Pid: 4881

Program_name: mysql

.

* * 5. Row * *

Thd_id: 185

Conn_id: 151

User: qfha@10.10.40.167

Db: sys

Command: Query

State: starting

Time: 564

Current_statement: SHOW BINARY LOGS

Statement_latency: 9.40 m

Progress: NULL

Lock_latency: 0 ps

Rows_examined: 0

Rows_sent: 0

Rows_affected: 0

Tmp_tables: 0

Tmp_disk_tables: 0

Full_scan: NO

Last_statement: NULL

Last_statement_latency: NULL

Current_memory: 0 bytes

Last_wait: wait/synch/mutex/sql/MYSQL_BIN_LOG::LOCK_log

Last_wait_latency: Still Waiting

Source: rpl_master.cc:707

Trx_latency: NULL

Trx_state: NULL

Trx_autocommit: NULL

Pid: NULL

Program_name: NULL

* 6. Row * *

.

* 11. Row * *

Thd_id: 86

Conn_id: 52

User: ADMIN@127.0.0.1

Db:test

Command: Query

State: query end

Time: 244

Current_statement: update xx_heartbeat set hb_... () where server_id=@@server_id # SQL for operating the heartbeat table

Statement_latency: 1.07 m

Progress: NULL

Lock_latency: 219.00 us

Rows_examined: 1

Rows_sent: 0

Rows_affected: 0

Tmp_tables: 0

Tmp_disk_tables: 0

Full_scan: NO

Last_statement: NULL

Last_statement_latency: NULL

Current_memory: 0 bytes

Last_wait: wait/synch/cond/sql/MYSQL_BIN_LOG::COND_done # found that a session with a connection ID of 52 was waiting for this event when executing a heartbeat SQL statement

Last_wait_latency: Still Waiting # interprets the same fields as the first line of information

Source: binlog.cc:1949 # interprets the same fields as the first line of information

Trx_latency: NULL

Trx_state: NULL

Trx_autocommit: NULL

Pid: 6772

Program_name: mysql

* 12. Row * *

.

15 rows in set (0.15 sec) 2.10. From the information in step 2.9, it is almost certain that it is caused by the problem that the writing binlog cannot respond in time, but what exactly is the problem with binlog? What does it have to do with the blockage of the heartbeat SQL? At this time, after the normal execution of all the above SQL is completed (the waiting time is about a few minutes, exceeding the high available switching time of 60s), we parse all the binlog files containing the above business SQL and heartbeat SQL for analysis (binlog is about 22G), check the order of submission of these statements, and find that they are all in the same group, and the heartbeat SQL ranks last in the business SQL. Details are as follows:

Click (here) to collapse or open

[root@10-10-66-229 binlog] # mysqlbinlog-vv mysql-bin.000175 | grep-E 'insert sbtest | insert into sbtest | last_committed | update xx_heartbeat | GTID_NEXT' > a.sql

[root@10-10-66-229 binlog] # cat a.sql

.

# 171011 15:01:42 server id 3306103 end_log_pos 29975 CRC32 0x85a80516 GTID last_committed=76 sequence_number=77

SET @ @ SESSION.GTID_NEXT= '6d362bfe-ae21-11e7 color a22f color fabb9a69be00pur90292According to the price /

# update xx_heartbeat set hb_time=now () where server_id=@@server_id

# 171011 15:01:43 server id 3306103 end_log_pos 30366 CRC32 0x14c9d915 GTID last_committed=77 sequence_number=78

SET @ @ SESSION.GTID_NEXT= '6d362bfe-ae21-11e7 color a22f color fabb9a69be00pur90293 "Compact colors" /

# update xx_heartbeat set hb_time=now () where server_id=@@server_id

# the following is the key information we are looking for. Pay attention to the time of the event that records the group information, from which you can see that the order in which insert...select statements occur is sorted by table name.

# sbtest1, sbtest3, sbtest2, sbtest4,xx_ heartbeat table (the same chronological order as when you hit SQL on the client). From the message "last_committed=78", we can know that these SQL have been put into\

# for binlog submission in the same queue, we can know from the message "sequence_number=79" that the thread operating table sbtest4 is the highest in the binlog submission queue, that is to say, it is leader, and the others are follower.

# 171011 14:39:56 server id 3306103 end_log_pos 30757 CRC32 0xb2c44222 GTID last_committed=78 sequence_number=79

SET @ @ SESSION.GTID_NEXT= '6d362bfe-ae21-11e7 color a22f color fabb9a69be00pur90294 "Compact"

# insert sbtest4 (kMagnec from sbtest4) select kMagnec from sbtest4 # business SQL

# 171011 14:39:55 server id 3306103 end_log_pos 1429482395 CRC32 0x00c7685d GTID last_committed=78 sequence_number=80

SET @ @ SESSION.GTID_NEXT= '6d362bfe-ae21-11e7 color a22f color fabb9a69be00pur90295mm /

# insert sbtest3 (kMagnec from sbtest3) select kMagnec from sbtest3 # business SQL

# 171011 14:39:55 server id 3306103 end_log_pos 2858934033 CRC32 0xb0e9f488 GTID last_committed=78 sequence_number=81

SET @ @ SESSION.GTID_NEXT= '6d362bfe-ae21-11e7 color a22f color fabb9a69be00pur90296mm /

# insert sbtest2 (kMagnec from sbtest2) select kMagnec from sbtest2 # business SQL

# 171011 14:39:54 server id 3306103 end_log_pos 4288385671 CRC32 0x52cf3dfa GTID last_committed=78 sequence_number=82

SET @ @ SESSION.GTID_NEXT= '6d362bfe-ae21-11e7 color a22f color fabb9a69be00pur90297mm /

# insert sbtest1 (kMagnec from sbtest1) select kMagnec from sbtest1 # business SQL

# 171011 16:21:45 server id 3306103 end_log_pos 1422870013 CRC32 0xd3caff89 GTID last_committed=78 sequence_number=83

SET @ @ SESSION.GTID_NEXT= '6d362bfe-ae21-11e7 color a22f color fabb9a69be00pur90298mm /

# the following sentence SQL is the blocked heartbeat SQL. You can see that it ranks behind the previous business SQL, while the previous business SQL generates a lot of binlog data, which takes a lot of time to write to the binlog file.

# and because the heartbeat SQL is in the same group as the previous business SQL, you must wait for the binlog data of the previous business SQL to be written before you can write the binlog, so you can know from here,\

# heartbeat SQL is because the transaction of the heartbeat SQL cannot be committed because it is blocked by the business SQL during the commit binlog phase.

# update xx_heartbeat set hb_time=now () where server_id=@@server_id

SET @ @ SESSION.GTID_NEXT= 'AUTOMATIC' / * added by mysqlbinlog * / *! * / PS: at this point, we know why the heartbeat SQL is blocked, but in step 2.9, what does the event shown in the last_wait column of the output line of the session view mean? Interested students can look at the source code 3 according to the source column value of the information row in step 2.9. the solution to the problem of mutual blockage of unrelated table sentences caused by mutual exclusion when writing binlog needs to be analyzed according to the business requirements. Here, the blocking is heartbeat SQL, and the HA high availability program depends on the detection results of heartbeat SQL. The heartbeat SQL has a response time requirement, and the result must be returned on time. For the scenario we demonstrated, the function of the heartbeat SQL is to check whether the MySQL instance is alive and writable. So, for the DML of heartbeat SQL, recording binlog is not required, so our temporary solution is to dynamically turn off the function of recording BINLOG in heartbeat statement, so that it will not be blocked because of writing binlog, as follows:

Click (here) to collapse or open

Set sql_log_bin=0 Update xx_heartbeat set hb_time=now () where server_id=@@server_id4, summary of DML operations for large transactions, considering the two key performance indicators of the database (concurrency and response time), it is strongly recommended to split the amount of data (such as using limit to split by id or program split). Although it has been repeatedly emphasized in our development specification, for some people, he thinks it doesn't matter to slow down, so Here in HA's highly available heartbeat detection scene, it falls into the pit.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Database

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report