Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to understand MYSQL-GroupCommit and 2pc submission

2025-03-31 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Share

Shulou(Shulou.com)05/31 Report--

This article will explain in detail how to understand the submission of MYSQL-GroupCommit and 2pc. The content of the article is of high quality, so the editor shares it for you as a reference. I hope you will have some understanding of the relevant knowledge after reading this article.

Group commit (group commit) is an optimized way for MYSQL to deal with logs, mainly to solve the problem of frequent disk scrubbing when writing logs. Group submission has been continuously optimized with the development of MYSQL, from initially only supporting redo log group submission to the current official version 5.6 supporting both redo log and binlog group submission. The implementation of group commit greatly improves the transaction processing performance of mysql. Taking innodb storage engine as an example, the implementation principle of group commit in each stage is introduced in detail.

Group submission for redo log

WAL (Write-Ahead-Logging) is a common technology to achieve transaction persistence. The basic principle is that when committing a transaction, in order to avoid random writing of disk pages, it is only necessary to ensure that the redo log of the transaction is written to disk, so that the random writing of pages can be replaced by the sequential writing of redo log, and the persistence of transactions can be guaranteed, and the performance of the database system is improved. Although WAL uses sequential writes instead of random writes, each transaction commit still requires a log flush action, which is limited by disk IO, which is still the bottleneck of transaction concurrency.

The idea of group commit is to merge the flushing actions of multiple transactions redo log to reduce the disk write order. In Innodb's log system, each redo log has a LSN (Log Sequence Number), and LSN is monotonously increasing. Each transaction performs an update operation that contains one or more redo log, and when each transaction copies the log to log_sys_buffer (log_sys_buffer through log_mutex

Protection), the current maximum LSN is obtained, so it is guaranteed that the LSN of different transactions will not be duplicated. Then assuming that the maximum LSN of the log of the three transactions Trx1,Trx2 and Trx3 is LSN1,LSN2,LSN3 (LSN1lsn), it must be guaranteed that the transaction redo log that will be brushed with binlog must have been closed. By delaying writing redo, the purpose of redo log group submission is realized, and the competition of log_sys- > mutex is reduced. At present, this strategy has been introduced by the official mysql5.7.6.

Two-phase submission

In the stand-alone case, redo log group submission solves the log disk failure problem very well, so when binlog is enabled, can binlog enable group submission as well as redo log? After opening binlog first, one of the problems we need to solve is how to ensure the consistency of binlog and redo log. In Prepare stage, innodb brushes redo log, and sets rollback segment to Prepared state, binlog does not do any operation; in commit phase, innodb releases lock, releases rollback segment, sets submission status, and binlog brushes binlog log. When an exception occurs and fault recovery is needed, if it is found that the transaction is in the Prepare phase and binlog exists, it will be committed, otherwise it will be rolled back.

Group submission for binlog

Going back to the question in the previous section, after enabling binlog, how to achieve group submission on the basis of ensuring redo log-binlog consistency. Because of this problem, before 5.6, mysql could not implement group submission with binlog enabled. Redo log and binlog were serialized through a notorious prepare_commit_mutex, and the purpose of serialization was only to ensure the consistency of redo log-Binlog, but this implementation sacrificed performance. This situation is obviously intolerable, so various mysql branches, mariadb,facebook,perconal and so on have issued patches to improve this problem, and the official version 5.6 of mysql has finally solved this problem. As the solutions of each branch version are similar, I mainly explain the implementation method by analyzing the implementation of 5.6.

The basic idea of binlog group commit is to introduce a queue mechanism to ensure that the innodb commit order is consistent with the binlog order, and to group transactions, and the binlog disk brushing action in the group is handed over to a transaction to achieve the purpose of group commit. Binlog commit divides the submission into three phases, the FLUSH phase, the SYNC phase and the COMMIT phase. There is a queue in each stage, and each queue has a mutex protection. It is agreed that the first thread to enter the queue is leader, and the other thread is follower. Everything is left to leader to do. After leader has done all the actions, inform follower that the flushing ends. The basic process of binlog group submission is as follows:

FLUSH stage

1) hold Lock_log mutex [leader hold, follower wait]

2) get a set of binlog in the queue (all transactions in the queue)

3) transfer binlog buffer to Ibank O cache

4) notify the dump thread dump binlog

SYNC stage

1) release Lock_log mutex and hold Lock_sync mutex [leader hold, follower wait]

2) set a set of binlog to disk (sync action is the most time-consuming, assuming that sync_binlog is 1)

COMMIT stage

1) release Lock_sync mutex and hold Lock_commit mutex [leader hold, follower wait]

2) traverse the transactions in the queue and innodb commit them one by one

3) release Lock_commit mutex

4) Wake up the waiting thread in the queue

Note: because there are multiple queues, each queue has its own mutex protection, the queues are sequential, and a thread that is agreed to enter the queue is leader, so the leader of the FLUSH phase may be the follower of the SYNC phase, but the follower is always the follower.

From the above analysis, we know that MYSQL's current group submission approach solves consistency and performance issues. Consistency is resolved through two-phase commit, and disk IO performance is resolved through group commit of redo log and binlog.

Reference documentation

Http://mysqlmusings.blogspot.com/2012/06/binary-log-group-commit-in-mysql-56.html

Http://www.lupaworld.com/portal.php?mod=view&aid=250169&page=all

Http://www.oschina.net/question/12_89981

Http://kristiannielsen.livejournal.com/12254.html

Http://blog.chinaunix.net/uid-26896862-id-3432594.html

Http://www.csdn.net/article/2015-01-16/2823591

The transaction commit logic of MySQL is mainly completed in the function ha_commit_trans. Transaction commit involves transaction commit of binlog and specific storage engine. So MySQL uses 2PC to guarantee the integrity of the transaction. The 2PC process for MySQL is as follows:

| all.ha_list: all=1 thd- 4: | debug: stmt.unsafe_rollback_flags: Troup4: | debug: all.unsafe_rollback_flags: Troup4: | > trans_checkT@4: | ha_commit_trans Troup4: | | info: all=1 thd- > in_sub_stmt=0 ha_info=0x0 is_real_trans=1 Troup4: | | > MYSQL_BIN_LOG::commit Thum4: | | enter: thd: 0x2b9f4c07beb0 | All: yes, xid: 0, cache_mngr: 0x0 Tunable 4: | > ha_commit_lowT@4: | > THD::st_transaction::cleanupT@4: | > free_rootT@4: | enter: root: 0x2b9f4c07d660 flags: 1 Tunable 4: | pos_in_file

...

}

The function innobase_commit commits the transaction, gets the location of the current binlog, and then writes to the transaction system PAGE (trx_commit_off_kernel à trx_sys_update_mysql_binlog_offset).

InnoDB records the location of the MySQL binlog to trx system header:

/ / trx0sys.h

/ * The offset of the MySQL binlog offset info in the trx system header * /

# define TRX_SYS_MYSQL_LOG_INFO (UNIV_PAGE_SIZE-1000)

# define TRX_SYS_MYSQL_LOG_MAGIC_N_FLD 0 / * magic number which shows

If we have valid data in the

MySQL binlog info; the value

Is... _ MAGIC_N if yes * /

# define TRX_SYS_MYSQL_LOG_OFFSET_HIGH 4 / * high 4 bytes of the offset

Within that file * /

# define TRX_SYS_MYSQL_LOG_OFFSET_LOW 8 / * low 4 bytes of the offset

Within that file * /

# define TRX_SYS_MYSQL_LOG_NAME 12 / * MySQL log file name * /

5.3.2 transaction recovery process

When Innodb resumes, transactions in different states will be handled differently (see trx_rollback_or_clean_all_without_sess function):

For TRX_COMMITTED_IN_MEMORY transactions, clear the rollback segment, and then set the transaction to TRX_NOT_STARTED

For TRX_NOT_STARTED transactions, indicates that the transaction has been committed, skip

For TRX_PREPARED transactions, it is necessary to decide the fate of the transaction according to binlog. Skip temporarily.

For TRX_ACTIVE transactions, roll back.

When MySQL opens binlog, it checks the status of the binlog (TC_LOG_BINLOG::open). If the binlog is not shut down normally (LOG_EVENT_BINLOG_IN_USE_F is 1), the restore operation is performed as follows:

Scan binlog, read XID_EVENT transactions, and get a list of all committed XA transactions (in fact, transactions may be in prepare or commit in innodb)

For each XA transaction, call handlerton::recover to check whether the storage engine has the transaction in the prepare state (see innobase_xa_recover), that is, to check the state of the XA transaction in the storage engine

If there is a XA transaction in the prepare state, handlerton::commit_by_xid is called to commit the transaction

Otherwise, call handlerton::rollback_by_xid to roll back the XA transaction.

5.3.3 discussion on several parameters

(1) sync_binlog

Mysql calls MYSQL_LOG::write to complete the write binlog when committing the transaction, and decides whether or not to flush the disk according to sync_binlog. The default value is 0, which means that the disk is not brushed, thus giving control to OS. If set to 1, a flush occurs every time the transaction is committed; this has an impact on performance (binlog group is already supported in 5. 6), so many people set it to 100.

Bool MYSQL_LOG::flush_and_sync ()

{

Int err=0, fd=log_file.file

Safe_mutex_assert_owner & LOCK_log)

If (flush_io_cache (& log_file))

Return 1

If (+ + sync_binlog_counter > = sync_binlog_period & & sync_binlog_period)

{

Sync_binlog_counter= 0

Err=my_sync (fd, MYF (MY_WME))

}

Return err

}

(2) innodb_flush_log_at_trx_commit

This parameter controls the behavior of innodb brushing redo log when committing a transaction. The default value is 1, that is, every time a transaction is committed, the disk is brushed. To reduce the impact on performance, it is set to 2 or even 0 in many production environments.

< in: lsn up to which logs are to be flushed. */ { switch (srv_flush_log_at_trx_commit) { case 0: /* Do nothing */ break; case 1: /* Write the log and optionally flush it to disk */ log_write_up_to(lsn, LOG_WAIT_ONE_GROUP, srv_unix_file_flush_method != SRV_UNIX_NOSYNC); break; case 2: /* Write the log but do not flush it to disk */ log_write_up_to(lsn, LOG_WAIT_ONE_GROUP, FALSE); break; default: ut_error; }} If the value of innodb_flush_log_at_trx_commit is 0, the log buffer is written out to the log file once per second and the flush to disk operation is performed on the log file, but nothing is done at a transaction commit. When the value is 1 (the default), the log buffer is written out to the log file at each transaction commit and the flush to disk operation is performed on the log file. When the value is 2, the log buffer is written out to the file at each commit, but the flush to disk operation is not performed on it. However, the flushing on the log file takes place once per second also when the value is 2. Note that the once-per-second flushing is not 100% guaranteed to happen every second, due to process scheduling issues. The default value of 1 is required for full ACID compliance. You can achieve better performance by setting the value different from 1, but then you can lose up to one second worth of transactions in a crash. With a value of 0, any mysqld process crash can erase the last second of transactions. With a value of 2, only an operating system crash or a power outage can erase the last second of transactions. (3) innodb_support_xa 用于控制innodb是否支持XA事务的2PC,默认是TRUE。如果关闭,则innodb在prepare阶段就什么也不做;这可能会导致binlog的顺序与innodb提交的顺序不一致(比如A事务比B事务先写binlog,但是在innodb内部却可能A事务比B事务后提交),这会导致在恢复或者slave产生不同的数据。 int innobase_xa_prepare( /*================*/ /* out: 0 or error number */ THD* thd, /* in: handle to the MySQL thread of the user whose XA transaction should be prepared */ bool all) /* in: TRUE - commit transaction FALSE - the current SQL statement ended */ { … if (!thd->

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Database

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report