Deadlock Analysis of MySQL physical backup 07/02 Update SLTechnology News&Howtos

Deadlock Analysis of MySQL physical backup

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Background

This article analyzes the case that encountered deadlock when making a physical backup in the slave database under the 5.6 master / slave scenario. I hope it will be helpful to you.

The physical backup tool used here is Percona-XtraBackup (PXB), and some students may not be aware of its backup process, so let's briefly say that the backup steps of PXB are as follows:

Copy InnoDB redo log, which is a separate thread in the copy until the end of the backup; copy all InnoDB ibd files; add global read lock to execute FLUSH TABLES WITH READ LOCK (FTWRL); copy frm, MYD, MYI and other files; obtain location information, perform show slave status and show master status; unlock, UNLOCK TABLES; do some closing processing, backup ends.

If there are a lot of MyISAM tables, the global read lock will be held for a long time, so it is generally used to prepare the database for backup.

In addition, the FLUSH TABLE WITH READ LOCK command acquires two MDL locks, the global read lock (MDL_key::GLOBAL) and the global COMMIT (MDL_key::COMMIT) lock. For more information on MDL locks, please refer to the previous monthly report MDL implementation analysis.

Deadlock Analysis CASE 1

Let's take a look at what the scene looks like during the deadlock:

You can see that the show slave status has been blocked for a long time, and the SQL thread is in Waiting for commit lock, indicating that it is waiting for the COMMIT lock.

At this point, if we connect again to execute the show slave status, it will be blocked, and even if the Ctrl-C kill drops the thread, the thread is still there.

Pstack looks at the backtrace of the backtrace,show slave status thread of the relevant thread, which is obviously waiting for mutex, and the corresponding code is mysql_mutex_lock (& mi- > rli- > data_lock):

# 0 _ _ lll_lock_wait # 1 _ L_lock_974 # 2 _ _ GI___pthread_mutex_lock # 3 inline_mysql_mutex_lock # 4 show_slave_status # 5 mysql_execute_command # 6 mysql_parse # 7 dispatch_command # 8 do_command # 9 do_handle_one_connection # 10 handle_one_connection...

The backtrace of the SQL thread is as follows, waiting for the COMMIT lock:

# 0 pthread_cond_timedwait # 1 inline_mysql_cond_timedwait # 2 MDL_wait::timed_wait # 3 MDL_context::acquire_lock # 4 ha_commit_trans # 5 trans_commit # 6 Xid_log_event::do_commit # 7 Xid_log_event::do_apply_event # 8 Log_event::apply_event # 9 apply_event_and_update_pos # 10 exec_relay_log_event # 11 handle_slave_sql...

If we gdb in and debug the SQL thread, in MDL_context::acquire_lock:

(gdb) p (MDL_key::enum_mdl_namespace) lock- > key- > m_ptr [0] $24 = MDL_key::COMMIT (gdb) p ((THD*) lock- > massigranted.masking list.masking first-> masking ctx-> m_owner)-> thread_id $25 = 1

You can see that the COMMIT lock is held by thread 1.

The SQL thread holds rli_ptr- > data_lock before Xid_log_event::do_commit.

So it is now clear that thread 1 (backup thread) and thread 3 (SQL thread) deadlock, restore the deadlock process:

The backup thread executes the FTWRL and gets the COMMIT lock; the SQL thread executes to the Xid event, prepares to commit the transaction and requests the COMMIT lock, which is blocked by the backup thread; in order to obtain the slave execution site, the backup thread executes show slave status, which needs to obtain rli- > data_lock, which is blocked by the SQL thread.

In this way, two threads hold and wait for each other, forming a deadlock.

We know that MDL has deadlock detection, so why isn't it detected here? Because rli- > data_lock is a mutex and does not belong to the MDL system, in this deadlock scenario, the MDL lock system can only detect requests for COMMIT locks, and there is no deadlock.

Subsequent show slave status is blocked because a mutex is requested before the show slave status is executed:

Mysql_mutex_lock (& LOCK_active_mi) res= show_slave_status (thd, active_mi); mysql_mutex_unlock (& LOCK_active_mi)

The previous deadlock show slave status did not exit, and the back show slave status was naturally blocked on this mutex, and could not exit because thd- > killed could not be detected.

The reason for the deadlock is that the SQL thread holds the rli- > data_lock lock when it commits, which is not needed. MySQL officially fixes it in this patch.

CASE 2

After the bug fix above, the deadlock occurs again, but the deadlock situation is different. The show processlist result is as follows:

The SQL thread is still waiting for the commit lock, and then the show slave status is blocked and does not return. The difference is that there is an extra stop slave;. Let's take a look at stop slave's backtrace:

# 0 pthread_cond_timedwait # 1 inline_mysql_cond_timedwait # 2 terminate_slave_thread # 3 terminate_slave_threads # 4 stop_slave # 5 mysql_execute_command # 6 mysql_parse # 7 dispatch_command # 8 do_command # 9 do_handle_one_connection # 10 handle_one_connection...

Corresponding to the code, you can find that the stop slave is waiting for the SQL thread to exit, while the SQL thread is waiting for the COMMIT lock held by the backup thread (id=6). The whole deadlock process goes like this:

Backup thread executes FTWRL and gets COMMIT lock; SQL thread executes to Xid event, ready to commit transaction and requests COMMIT lock, which is blocked by backup thread; user executes stop slave, prepares to stop preparing database replication thread, and waits for SQL thread to exit; backup thread needs to acquire LOCK_active_mi lock in order to obtain LOCK_active_mi execution site, which is blocked by user thread (stop slave).

This time, the backup thread, the SQL thread and the user thread hold and wait for each other, forming a deadlock.

This time it's not the code bug, it's a usage problem, so if we find that the SQL thread is in Waiting for commit lock during the operation and maintenance process, we don't want stop slave.

Deadlock resolution

If there is an inevitable deadlock, how to solve it?

From the above analysis, we can see that both the backup thread and the user thread no longer accept responses in case 1 or case 2, and the only way to solve the deadlock is to kill the SQL thread, so is there a risk of direct kill?

The SQL thread can execute Xid event, which means it is updating the transaction engine table, and it should be OK to drop kill (transactions can be rolled back and redone later). However, with such a bug, SQL threads will be kill while waiting for COMMIT locks, skipping transactions directly, so the slave database will have one less transaction than the master database, so you need to make up for fewer transactions compared with the master / slave data after kill.

If you are using a version of MySQL that has fixed the bug, that is, version 5.6.21 or later, then the kill SQL thread is safe.

Deadlock recurrence

What if you want to recreate the deadlock in order to test or study the code? If a FTWRL is executed directly in the slave database, it will probably not be reproduced, because FTWRL acquires two locks, a global read lock and a global COMMIT lock, and the SQL thread is very likely to be blocked by the global read lock (Waiting for global read lock) rather than by the COMMIT lock (Waiting for commit lock).

One way is to write testcase, set the synchronization point with the dubug sync function, and let the thread stop at the specified place, but this requires mysqld to run in deubg mode and a certain foundation of MySQL source code development.

Another way is to change the code to extend the do_commit time, such as sleep for a period of time, which gives us enough time for FTWRL to complete before the SQL thread requests the COMMIT lock, but this requires changing the code, and then recompiling and installing

What if we don't know how to use debug sync and don't want to change the code to recompile and install it, but just want to test it in the existing environment? SYSTEMTAP!

Systemtap initially supports probing only in kernel space, but after version 0.6, it can be probed in user space, and using systemtap requires debug information in the program (with the-g option added when the program is compiled).

List all the places where we can detect mysqld.

Sudo stap-L'process ("/ usr/sbin/mysqld"). Function ("*")'

List all the places where you can probe the Xid_log_event class.

Sudo stap-L'process ("/ usr/sbin/mysqld"). Function ("* Xid_log_event::*")'

If we want a bit of a delay in Xid_log_event::do_commit execution, we can do this:

Sudo stap-v-g-d / usr/bin/mysqld-- ldd-e'probe process (16011). Function ("Xid_log_event::do_commit") {printf ("got it\ n") mdelay (3000)}'

16011 is the running standby process PID. After executing the above stap command, whenever the slave executes Xid_log_event::do_commit, stap will type a "got it", and then the SQL thread pauses for 3 seconds, which gives us plenty of time to execute FTWRL and get the COMMIT lock before the SQL thread commit.

Transferred from: http://www.kancloud.cn/taobaomysql/monthly/117960

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.