Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

The whole process of fault analysis of MySQL restart caused by the crash of semi-sync plug-in

2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Share

Shulou(Shulou.com)06/01 Report--

Introduction to the whole process of fault analysis of MySQL restart caused by the crash of semi-sync plug-in. Author: Liu an is now a senior test engineer with rich experience in automatic test development. He has worked as a test engineer in Pactera and Splunk.

Cause: an exception was found when the company tested the MySQL highly available components. As follows: after stopping the slave library, the highly available components will automatically start the slave library, and the master library will restart. However, under normal circumstances, the main library should not be restarted. 2. Environment: 1. OS: CentOS release 6.7 (Final); 2. MySQL:Ver 14.14 Distrib 5.7.13, for linux-glibc2.5 (x86 / 64) using EditLine wrapper; 3. Open semi-synchronous MySQL semi-sync configuration documents for master and slave instances. Third, analysis: (1) first, analyze the log file: in the mysql-error.log of the main library, we can find the following points: 1. The semi-sync of the main library makes a start and stop action before the crash: 2017-07-25T16:02:38.636061+08:00 40 [Note] Semi-sync replication switched OFF. 2017-07-25T16:02:38.636105+08:00 40 [Note] Semi-sync replication disabled on the master. 2017-07-25T16:02:38.636137+08:00 0 [Note] Stopping ack receiver thread 2017-07-25T16:02:38.638008+08:00 40 [Note] Semi-sync replication enabled on the master. 2017-07-25T16:02:38.638131+08:00 0 [Note] Starting ack receiver thread 2. Immediately after the semi-sync of the main library starts, the semi-sync plug-in crashes and an Assert exception is thrown. Mysqld: / export/home/pb2/build/sb_0-19016729-1464157976.67/mysql-5.7.13/plugin/semisync/semisync_master.cc:844: int ReplSemiSyncMaster::commitTrx (const char*, my_off_t): Assertion `entry' failed. 08:02:38 UTC-mysqld got signal 6; 3, here is a rather important clue to the recurrence of the fault. To put it simply, this SQL statement is to detect the difference between master and slave data, and the timestamp is constantly updated to the database by highly available middleware. Trying to get some variables. Some pointers may be invalid and cause the dump to abort. Query (7f408c0054c0): update universe.u_delay set real_timestamp=now (), logic_timestamp = logic_timestamp + 1 where source = 'ustats' Connection ID (thread ID): 61 Status: NOT_KILLED preliminary judgment here is that the fault should be related to setting the rpl_semi_sync_master_enabled switch, and there should be a transaction commit at that time. (2) secondly, further verification: only a group of MySQL instances with semi-sync enabled are deployed here without installing highly available components, and the bash script is used to insert data into the main database continuously: #! / usr/bin/env bash / opt/mysql/base/bin/mysql-uroot-p1-S / opt/mysql/data/3306/mysqld.sock-e "create database if not exists test;use test;drop table if exists T1" Create table T1 (id int) "iTun0 while true do / opt/mysql/base/bin/mysql-uroot-p1-S / opt/mysql/data/3306/mysqld.sock-e" insert into test.t1 values ($I) "iLife1) done on the machine of the main library, repeatedly run the following command to start and stop semi-sync master, and no more than 5 times will be able to reproduce the fault. / opt/mysql/base/bin/mysql-uroot-p1-S / opt/mysql/data/3306/mysqld.sock-e'SET GLOBAL rpl_semi_sync_master_enabled = OFF;SET GLOBAL rpl_semi_sync_master_enabled = ON' so there are two necessary conditions for a recurrence of this failure: there were transaction commits at the time of starting and stopping the semi-sync master; database. (3) finally, analyze the MySQL source code: as to why you can't trigger this fault every time you start or stop semi-sync master, then we have to look at the MySQL source code. Fortunately, mysql-error.log clearly pointed out where the exception was thrown: mysqld: / export/home/pb2/build/sb_0-19016729-1464157976.67/mysql-5.7.13/plugin/semisync/semisync_master.cc:844: int ReplSemiSyncMaster::commitTrx (const char*, my_off_t): Assertion `entry' failed. 08:02:38 UTC-mysqld got signal 6; This could be because you hit a bug. It is also possible that this binary or one of the libraries it was linked against is corrupt, improperly built, or misconfigured. This error can also be caused by malfunctioning hardware. Attempting to collect some information that could help diagnose the problem. As this is a crash and something is definitely wrong, the information collection process might fail. Find the MySQL source code and throw the location of the exception mysql-5.7.13 semisync_master.cc. / * wait for the position to be ACK'ed back * / assert (entry); entry- > nailed waitersqualified; wait_result= mysql_cond_timedwait (& entry- > cond, & LOCK_binlog_, & abstime); it seems that the exception is thrown because the value of 'entry' is NULL, but this has not answered my question. Looking at the error message of MySQL, MySQL also realizes that this should be a BUG, so has it fixed the fault? I found the latest version of MySQL 5.7.19 and checked its history mysql-5.7.19 semisync_master.cc history. This is exactly what I want to fix and submit. Open it to see if there is any analysis of the cause of the failure, Bug#22202516: ENABLING SEMI-SYNC DURING COMMIT CAN CAUSE MASTER TO ASSERT. To facilitate the explanation of this BUG, here is a brief introduction to the MySQL binlog submission process: MySQL introduced Binary Log Group Commit in version 5.6. The commit of the binary log can thus be simplified into three phases: Flush stage: write the log of the transaction to the cache of the binlog file Sync stage: write the cached data in the binlog file to disk Commit stage: call the storage engine commit transaction MYSQL_BIN_LOG::ordered_commit sequentially, which is the core function of the transaction commit during the binlog phase. Through this function, you can write the transaction log to the binlog file, trigger the binlog thread to send the binlog to the slave library, and finally set the transaction to commit state; in fact, with or without the semi-sync mechanism, the process of binlog in the transaction commit described above is the same. The role of semi-sync is only a confirmation process between master and slave. The binlog waiting for the master library to return to the relevant location from the library has been synchronized to the confirmation of the slave library. Before it is confirmed, the transaction commit waits on the function (step) until it is returned. In semi-synchronous replication, in order to wait for the confirmation of the slave library when the master database commits the transaction, the active transaction list composed of 'entry' is drawn out. In the BUG analysis, it is mentioned that, first in flush stage, semi-sync creates a 'entry', and associates one or a group of transactions at flush stage. This' entry' is inserted into an active transaction linked list. Reference: create entry.

Int ReplSemiSyncMaster::writeTranxInBinlog (const char* log_file_name, my_off_t log_file_pos)

{

...

If (is_on ())

{

Assert (active_tranxs_! = NULL)

If (active_tranxs_- > insert_tranx_node (log_file_name, log_file_pos))

{

/ *

If insert tranx_node failed, print a warning message

And turn off semi-sync

, /

Sql_print_warning ("Semi-sync failed to insert tranx_node for binlog file:% s, position:% lu"

Log_file_name, (ulong) log_file_pos)

Switch_off ()

}

}

...

} then there is sync stage. In sync stage, each thread at this stage will set trx_wait_binlog_name and trx_wait_binlog_pos as the relative position of the transaction corresponding to the binlog. Finally, in order to wait for the confirmation of the slave library, commit stage,semi-sync will use trx_wait_binlog_name and trx_wait_binlog_pos to obtain the associated 'entry', reference: find entry.

Int ReplSemiSyncMaster::commitTrx (const char* trx_wait_binlog_name, my_off_t trx_wait_binlog_pos)

{

...

TranxNode* entry= NULL

Mysql_cond_t* thd_cond= NULL

If (active_tranxs_! = NULL & & trx_wait_binlog_name)

{

Entry=

Active_tranxs_- > find_active_tranx_node (trx_wait_binlog_name

Trx_wait_binlog_pos)

If (entry)

Thd_cond= & entry- > cond

}

...

} there are two scenarios where 'entry' cannot be found: 1. The location of the transaction that has been confirmed by the slave database in binlog is larger than that in the binlog that is currently waiting for confirmation from the slave database; 2. When the transaction enters the flush stage, semi-sync is not opened, so no' entry' is created, and then inserted into the active transaction linked list. Case 1 will never enter the waiting phase because it has already been confirmed from the library; case 2 will cause the above assertion exception because it cannot find the corresponding 'entry' in the active transaction linked list. At this point, my question is finally solved. It turns out that the timing of opening semi-sync master is very important, to happen to hit the gap that the transaction has been committed to flush stage but has not yet reached the gap of commit stage. Conclusion: the result of the final fix is also obvious, that is, if the 'entry' is not found and there is no confirmation from the library, the transaction commit is considered an asynchronous commit. Take another look at which versions of the BUG have been fixed:

So upgrading the MySQL version can solve the problem. Fifth, review: finally, summarize my diagnosis path: observe the phenomena of the fault, analyze the MySQL error log; through conjecture and experiment, form a simple recurrence scene; through the corresponding relationship between the log and the MySQL source code, search the code history, locate the BUG number; by reading the BUG analysis, understand the principle of abnormal occurrence and know the detailed recurrence conditions.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Database

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report