Example Analysis of the Internal Mechanism of InnoDB in MySQL 07/03 Update SLTechnology News&Howtos

Example Analysis of the Internal Mechanism of InnoDB in MySQL

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces the example analysis of the internal mechanism of InnoDB in MySQL, which has a certain reference value, and interested friends can refer to it. I hope you will gain a lot after reading this article.

Read view

InnoDB supports multiple versions of MVCC, where RC (Read Committed) and RR (Repeatable Read) isolation levels are supported using consistent read view (consistent read View). The so-called consistent read view is to take a snapshot (snapshot) to the transaction system trx_sys at a certain time, write down the state of the trx_sys at that time (including the active read and write transaction array), and then compare all the read operations according to the state of its transaction ID (i.e. trx_id) and the trx_sys in snapshot, so as to judge the visibility of read view to the transaction.

The trx_sys states saved in Read view mainly include

Low_limit_id:high water mark, transactions greater than or equal to view- > low_limit_id are not visible to view

Up_limit_id:low water mark, transactions less than view- > up_limit_id must be visible to view

Undo log with low_limit_no:trx_no less than view- > low_limit_no can be purge for view.

Rw_trx_ids: read-write transaction array

The difference between the RR isolation level (except for Gap locks) and the RC isolation level is the timing when the snapshot is created. The RR isolation level creates the read view at the beginning of the transaction, or rather, the first read operation; the RC isolation level creates the read view at the beginning of the statement.

Creating / closing read view requires holding trx_sys- > mutex, which will degrade system performance. Version 5.7 optimizes this, and session will cache the read view of read-only transactions when the transaction commits.

The next time a read view is created, it is judged that if it is a read-only transaction and the state of the read-write transaction of the system has not changed, that is, if the max_trx_id of the trx_sys is not pushed forward and no new read-write transaction is generated, the last read view can be reused.

After the Read view is created, the visibility can be determined by comparing the high/low water mark and read-write transaction arrays of the last updated trx_id and view when reading the data.

As mentioned earlier, if recording the latest data is the result of an update of the current transaction trx, the corresponding previous read view must be visible.

In addition, you can quickly judge through high/low water mark:

Trx_id

< view->

The record of up_limit_id must be visible to the current read view

The record of trx_id > = view- > low_limit_id must not be visible to the current read view

If the trx_id falls on [up_limit_id, low_limit_id), you need to find out whether the trx_id exists in the array of active read and write transactions, and if so, the record is not visible to the current read view.

Because the secondary index of InnoDB only stores the trx_id last updated by page, when querying with the secondary index, if the trx_id of page is less than view- > up_limit_id, you can directly judge that all records of page are visible to the current view, otherwise you need to go back to the clustered index to judge.

If the record is not visible to the view, the current view visible version data needs to be constructed by traversing the history list with the recorded DB_ROLL_PTR pointer.

Rollback segment

InnoDB also uses rollback segments to build old version records, which is similar to Oracle.

The recorded DB_ROLL_PTR points to the rollback segment created by the most recent update; each undo log also points to an earlier version of undo log, forming an update chain. Through this update chain, different transactions can find their corresponding versions of undo log to form old version records, which is the history list of records.

Assign rollback segment

MySQL 5.6 defaults to read-write transactions if the specified READ ONLY transaction is not displayed. The trx_id and rollback segments are allocated at the start of the transaction, and the current transaction is added to the read-write transaction array of trx_sys.

Version 5.7 defaults to read-only transactions for all transactions. When the first write operation is encountered, the read-only transaction is switched to the read-write transaction allocation trx_id and rollback segment, and the current transaction is added to the read-write transaction array of trx_sys.

The work of assigning the rollback segment is carried out in the function trx_assign_rseg_low, and the allocation strategy is in round-robin mode.

Independent undo tablespaces are supported since 5.6. InnoDB supports 128undo rollback segments, please refer to the first article.

Rseg0: reserved in the system tablespace ibdata

Rseg1~rseg32: these 32 rollback segments are stored in the system tablespace of the temporary table

Rseg33~rseg127: stored in a separate undo table space according to configuration (or in ibdata if the separate Undo table space is not open)

Trx_assign_rseg_low determines that if a separate undo table space is supported, avoid using the rollback segment of the system table space if the undo table space has available rollback segments.

Rseg- > skip_allocation for TRUE indicates that the tablespace where rseg is to be truncate should be avoided using this rseg to allocate rollback segments. In this case, you must ensure that there are at least 2 active undo table spaces and at least 2 active undo slot.

When the allocation is successful, increment rseg- > trx_ref_count to ensure that the tablespace of rseg will not be truncate.

The temporary table operation does not remember redo log, and the get_next_noredo_rseg function is finally called for allocation; other cases call get_next_redo_rseg.

The rollback segment is actually a undo file organization. Each rollback segment maintains a segment header page (segment header). The page is divided into 1024 slot (TRX_RSEG_N_SLOTS), and each slot corresponds to a undo log object.

In theory, InnoDB supports up to 96 (12832 / * temp-tablespace * /) * 1024 general transactions.

However, if it is a temporary table transaction, you may need to allocate an additional slot (the system tablespace of the temporary table).

The read-only phase is assigned to temporary tables and is allocated in the system tablespace of temporary tables

The read and write phase is allocated in the undo tablespace

Assign undo log

Insert data is visible only to the current transaction or after commit, so insert's undo log can be released after the transaction commit.

Update/delete 's undo records are usually used to maintain old version records and provide services for queries; they can be released only if no view in the trx_sys needs to access that old version's data.

InnoDB assigns different undo slot to insert and update/delete

The undo slot of insert is recorded in trx- > rsegs.m_redo.insert_undo, which is assigned by calling trx_undo_assign_undo

The undo slot of update is recorded in trx- > rsegs.m_redo.undate_undo, which is assigned by calling trx_undo_assign_undo

Trx_undo_assign_undo

i. Check whether the cached queue has cached undo log (the in-memory data structure is trx_undo_t)

If present, remove the undo log from the cached queue

The logic of reuse:

A.insert undo: reinitialize the header information (trx_undo_insert_header_reuse) of undo page and record a MLOG_UNDO_HDR_REUSE log in redo log

B.update undo: assign a new undo header (trx_undo_header_create) on undo page's header and record a MLOG_UNDO_HDR_CREATE log in redo log

Reserve xid space

Reinitialize undo (trx_undo_mem_init_for_reuse) to set undo- > state to TRX_UNDO_ACTIVE and write undo- > state to the TRX_UNDO_SEG_HDR+TRX_UNDO_STATE location of the first undo page

Note 1:TRX_UNDO_SEG_HDR indicates that there is an one-to-one correspondence between the segment header initiation offset note 2:undo segment and the transaction trx, and the TRX_UNDO_STATE of the undo segment header corresponds to the current state of the transaction.

The following figure (quoted from the first article)

Undo segment is a separate segment, and each undo segment contains 1 header page (the first undo page) and several undo page that record undo logs.

The first undo page stores meta-information: the first is the meta-information of undo page, located between TRX_UNDO_PAGE_HDR and TRX_UNDO_SEG_HDR.

TRX_UNDO_PAGE_START: point to the first undo log TRX_UNDO_PAGE_FREE in page: point to the location of the next undo log in the page to write to TRX_UNDO_PAGE_NODE:undo segment all page form a two-way linked list, the TRX_UNDO_PAGE_NODE field of each page is used as the connector, and the TRX_UNDO_PAGE_LIST in the first undo page is used as the header

/ * undo page header * / # define TRX_UNDO_PAGE_HDR FSEG_PAGE_DATA # define TRX_UNDO_PAGE_TYPE 0 / *!

< TRX_UNDO_INSERT or TRX_UNDO_UPDATE */ #define TRX_UNDO_PAGE_START 2 /*!< Byte offset where the undo log records for the LATEST transaction start on this page (remember that in an update undo log, the first page can contain several undo logs) */ #define TRX_UNDO_PAGE_FREE 4 /*!< On each page of the undo log this field contains the byte offset of the first free byte on the page */ #define TRX_UNDO_PAGE_NODE 6 /*!< The file list node in the chain of undo log pages */ /*-------------------------------------------------------------*/ #define TRX_UNDO_PAGE_HDR_SIZE (6 + FLST_NODE_SIZE) /*!< Size of the transaction undo log page header, in bytes */ 之后是undo segment的元信息，位于TRX_UNDO_SEG_HDR到TRX_UNDO_SEG_HDR+TRX_UNDO_SEG_HDR_SIZE TRX_UNDO_STATE：表示undo segment的状态，一个undo segment可以包含多个undo log，但至多只有1个active undo log，也就是最近的undo log TRX_UNDO_LAST_LOG：指向最近的undo log的header信息 TRX_UNDO_FSEG_HEADER：存储的是undo segment对应的file segment信息，在fseg_create_general中设置（4字节space id，4字节的page no，2字节的page offset) undo segment从buffer pool移除被persist到磁盘时，就写到file segment指定的位置上 #define TRX_UNDO_SEG_HDR (TRX_UNDO_PAGE_HDR + TRX_UNDO_PAGE_HDR_SIZE) #define TRX_UNDO_STATE 0 /*!< TRX_UNDO_ACTIVE, ... */ #define TRX_UNDO_LAST_LOG 2 /*!< Offset of the last undo log header on the segment header page, 0 if none */ #define TRX_UNDO_FSEG_HEADER 4 /*!< Header for the file segment which the undo log segment occupies */ #define TRX_UNDO_PAGE_LIST (4 + FSEG_HEADER_SIZE) /*!< Base node for the list of pages in the undo log segment; defined only on the undo log segment's first page */ /*-------------------------------------------------------------*/ /** Size of the undo log segment header */ #define TRX_UNDO_SEG_HDR_SIZE (4 + FSEG_HEADER_SIZE + FLST_BASE_NODE_SIZE) 再之后是undo log header信息，所有的undo log header都存储在第一个undo page上。 II. 从cached队列分配undo失败时，需要真正分配一个undo segment（trx_undo_seg_create）首先要从rseg分配一个slot（trx_rsegf_undo_find_free），每个rseg至多支持1024个slot。找到空slot返回index。如果当前rseg已满，trx_undo_seg_create返回DB_TOO_MANY_CONCURRENT_TRXS向上层报错，表示并发事务太多无法创建undo segment。然后在rseg对应的table space创建一个新的file segment，file segment信息记在segment header的TRX_UNDO_FSEG_HEADER（fseg_create_general）。 trx_undo_seg_create在创建file segment之后，把新创建segment的page no写到rseg对应slot上建立映射关系，并返回新创建segment的page。 file segment与undo segment的映射关系，还有rseg[slot]与file segment对应page的映射关系都是在trx_undo_seg_create绑定的。cached undo不会更新这两个映射关系。 III. trx_undo_seg_create返回的page上创建新的undo header；上层负责初始化trx_undo_t数据结构 trx_undo_create为新创建的undo header创建内存数据结构trx_undo_t（trx_undo_mem_create），把undo->

State is set to TRX_UNDO_ACTIVE.

iv. The assigned trx_undo_t will be added to the insert_undo_list or update_undo_list queue of the transaction

Write to undo log

After trx_undo_assign_undo allocates undo, you can write undo records to it. The page written comes from undo- > last_page_no, which is initially equal to hdr_page_no.

Update undo contains an important part: the pointer to the current rollback segment of the record is written to the undo log in order to maintain the historical data link of the record.

When read view needs to read the old version of the data, it starts to find the visible version of the data forward through the current rollback segment pointer in the record.

When the Undo log write is complete, build a new rollback segment pointer and return (trx_undo_build_roll_ptr), which is the DB_ROLL_PTR recorded by the clustered index.

The rollback segment pointer includes rseg- > id, the page no where the log resides, and the offset within the page, which needs to be recorded in the clustered index record. Here rseg- > id is used to determine rseg- > space. What is really used to locate the undo log is space, undo- > page,undo- > page_offset > triple.

Transaction prepare

Set undo- > state to TRX_UNDO_PREPARED and write this state to the (TRX_UNDO_SEG_HDR+TRX_UNDO_STATE) location of the first undo page.

In addition, the xid information is updated during the prepare phase.

Transaction commit

During the transaction commit phase, you need to set undo- > state to the completed state and add undo to the history list of undo segment. The undo header being committed is pointed to the first item of history list, which represents the most recent undo of the current transaction history list.

There are 3 kinds of undo- > state completion status, which are set in trx_undo_set_state_at_finish.

Undo occupies only one page, and the space used by the first undo page is less than 3 undo page 4 (TRX_UNDO_PAGE_REUSE_LIMIT): status is set to TRX_UNDO_CACHED

If 1 is not satisfied, if it is insert_undo (TRX_UNDO_INSERT): status is set to TRX_UNDO_TO_FREE

If 1 and 2 are not satisfied, the status is set to TRX_UNDO_TO_PURGE, indicating that undo may require purge thread cleanup

Cached undo will be added to the cached queue, which is the cached queue that trx_undo_assign_undo mentioned

After setting undo- > state, you need to write this state to the (TRX_UNDO_SEG_HDR+TRX_UNDO_STATE) location of the first undo page

Add undo to undo segment header's history list

The old version of Insert has no practical meaning, so insert undo can be released when transacting commit.

There is a cached strategy in trx_undo_set_state_at_finish. If there is only one undo page, and there is not enough space in undo page, pagesize's 3 pagesize 4 can be reuse. In fact, this is the case with most insert undo.

Update undo needs to maintain history list. Here I first mention trx- > no, which maintains the trx commit order of transactions, which, like the trx_id of transactions, is generated incrementally using max_trx_id.

In addition, purge_sys (the global data structure of purge) maintains a minimum heap, and each rollback segment inserts data into the minimum heap when the first transaction commits, in order to find the rollback segment with the smallest trx_no for purge. After processing one rseg each time, the trx_no of the next undo record is pushed into the smallest heap as the cursor of the rseg.

When commit a transaction, in the order of trx- > no, hang the current undo log of the transaction to the header of the undo segment history list and point to the nearest undo log of the transaction.

All the undo in the History list are committed transactions, and the undo log modified by the current firm is recorded here, arranged in the new-> old way, with the oldest undo log at the end.

Undo is added to the history list by using the TRX_UNDO_HISTORY_NODE of undo log as the connector and adding the TRX_RSEG_HISTORY of the first undo page.

In general, undo is added to the history list every time trx_purge_add_update_undo_to_history is called, and the history list size is updated only if the undo page cannot be reuse (it can be considered an optimization, the last update of history length).

After that, trx_purge_add_update_undo_to_history will update the TRX_UNDO_TRX_NO of undo log header to trx_no.

If undo- > del_marks is FALSE, this function also updates TRX_UNDO_DEL_MARKS (undo segment creation or reuse is initialized to TRUE), clarifying that this is not delete marker.

If you have the first transaction commit since the undo segment was created (or possibly since the last purge was completed), you also need to update some parameters related to the purge to point to where the next purge execution starts.

Old version of data purge

The old version of data can be deleted without any view access. Version 5.6 and above supports independent purge threads. Users can set the number of purge threads through the parameter Innodb_purge_threads.

There are two types of purge threads:

Coordinator thread:srv_purge_coordinator_thread, there is only 1 global

Worker thread:srv_worker_thread, the system has innodb_purge_threads-1

Coordinator thread is responsible for launching worker thread to participate in purge work.

The strategy for adding purge threads is that trx_sys- > rseg_history_len is larger than the last loop or rseg_history_len exceeds a certain threshold, and more worker thread needs to be introduced.

The strategy for reducing purge threads is that if you previously used multiple purge threads, trx_sys- > rseg_history_len did not get larger, and you may need to reduce worker thread.

Before doing purge, you first need to determine what the purge thread is going to do, that is, which undo log can be purged.

Purge also determines the scope of work through read view, which is called purge view. If the system has an active read view, select the oldest read view as the purge view.

If it does not exist, type snapshot to the status of the trx_sys. As a purge view, the trx_no of the undo log that can be purge must be less than the trx- > no of all committed transactions in the system.

Here, when transacting commit, the resulting trx- > no is added to the trx_sys- > serialisation_list linked list, which is sorted in ascending order trx- > no, that is, the trx commit order is maintained.

When InnoDB initializes, it initializes the purge_sys data structure, and one of the tasks is to create a purge graph.

This is a diagram of a total of three layers:

Layer 1 is the fork node

The second time is the thrd node (for purge thread)

Layer 3 is the node node (representing purge task)

All thrd nodes are linked to the fork- > thrs list; the fork address is stored in purge_sys- > query and can be accessed directly through purge_sys.

When executing purge, always iterate through the purge_sys- > query- > thrs linked list, assigning each purge thread a purge task (trx_purge_attach_undo_recs).

The calling path for resolving undo log is as follows:

Srv_purge_coordinator_thread-> srv_do_purge-> trx_purge-> trx_purge_attach_undo_recs-> trx_purge_fetch_next_rec-> trx_purge_get_next_rec

When purge_sys- > next_stored is FALSE, the rseg that rseg_iter is currently pointing to is invalid. You need to move the rseg_iter to the next valid rseg (TrxUndoRsegsIterator::set_next).

Purge_sys- > purge_queue maintains a minimum heap, and each time you pop the top element, you get the smallest rollback segment (TrxUndoRsegsIterator::set_next) of trx_no.

5.7The rollback segment,set_next of noredo that supports temporary tables will pop both redo rollback segment and noredo rollback segment into the purge_sys- > rseg_iter- > m_trx_undo_rsegs array, which is also implemented in TrxUndoRsegsIterator::set_next.

If no rollback segment requires purge, the purge_sys- > rseg setting to NULL,purge thread will trx_purge_choose_next_log.

In general, there is a rollback segment to deal with, purge_sys- > rseg is updated to purge_sys- > rseg_iter- > m_trx_undo_rsegs item 1 (up to 2 items).

The corresponding members in the purge_sys also need to be updated to point to the location (TrxUndoRsegsIterator::set_next) where the current rseg was last purge.

The del_ marks domain of update undo is normally TRUE, because all update/delete operations require the old value tag to be deleted.

If purge_sys- > rseg- > last_del_marks is FALSE, it means that this is a undo log of dummy and does not need to be physically deleted. In this case, set purge_sys- > offset to 0 and make a tag to indicate that the undo log does not need to be purged (trx_purge_read_undo_rec).

Normally, purge_sys- > rseg- > last_del_marks is TRUE. You can read undo log records (trx_purge_read_undo_rec) through rseg- > space, purge_sys- > hdr_page_no, purge_sys- > hdr_offset >.

And set the following four fields of purge_sys to undo log to record the corresponding information (trx_purge_read_undo_rec).

Purge_sys- > offset = offset; / * undo log record offset * / purge_sys- > page_no = page_no; / * undo log record pageno * / purge_sys- > iter.undo_no = undo_no; / * undo log record undo_no,trx internal undo sequence number * / purge_sys- > iter.undo_rseg_space = undo_rseg_space; / * undo log tablespace * /

In order to ensure that the above four domains of purge_sys must point to the next valid undo log, each time you read the undo log, you will read the next undo log, and update the above four fields to the information of the next undo log (trx_purge_get_next_rec).

If it is dummy undo,trx_purge_get_next_rec, it will read prev_undo (trx_purge_rseg_get_next_history_log) and update the next purge information in rseg with prev_log information.

After that, the rseg- > last_trx_no will be pressed into the minimum heap, and the rseg will be processed later. Then call trx_purge_choose_next_log to select the next processed rseg and read the first undo log (trx_purge_get_next_rec).

In this way, there is a big loop in reading undo log,trx_purge_attach_undo_recs one by one. After each call to trx_purge_fetch_next_rec to read a undo log, it is stored in the node- > undo_recs array of the purge node (the third node of purge graph), and the loop switches to the next thr (purge thread) next time.

The end condition of the loop is:

There is no new undo log

The processed undo log reaches batch size (usually 300)

When the end of the loop condition is reached, trx_purge_attach_undo_recs returns. If n_purge_threads > 1 (requires the worker thread to participate in the purge), the coordinator thread starts n_purge_threads-1 worker thread in round-robin mode.

The purge,coordinator thread invokes que_run_threads (in the trx_purge context) to process the purge task with or without the worker thread participating.

How to handle the purge task? Generally speaking, purge is to delete the entry of the marked delete marker.

The general process is as follows:

Srv_purge_coordinator_thread-> srv_do_purge-> trx_purge-> que_run_threads-> que_run_threads_low-> que_thr_step row_purge_step-> row_purge-> row_purge_record-> row_purge_del_mark-> row_purge_remove_sec_if_poss

The general principle of deletion is to delete the secondary index before deleting the clustered index (row_purge_del_mark).

Another situation is that the clustered index in-place is updated, but the order of records on the secondary index may change, and the update of the secondary index is always marked delete + insert, so it is necessary to check whether the record order of the secondary index has changed according to the rollback segment record and perform a cleanup operation (row_purge_upd_exist_or_extern).

As mentioned earlier, when you are in parse undo log, you may encounter dummy undo log. When you return to row_purge execution, you need to determine whether it is dummy undo, and if so, do nothing.

Truncate undo space

After trx_purge has finished processing a batch (usually 300), it calls trx_purge_truncate_historypurge_sys to try to release undo log (trx_purge_truncate_rseg_history) for each rseg.

The general process is to remove each purge-passed undo log from the history list, and if all the undo log in the undo segment is released, you can try to release the undo segment, where the implicit release of the file segment achieves the purpose of freeing storage space.

Thank you for reading this article carefully. I hope the article "sample Analysis of the Internal Mechanism of InnoDB in MySQL" shared by the editor will be helpful to you. At the same time, I also hope that you will support and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.