How to understand MySQL persistence and rollback 07/08 Update SLTechnology News&Howtos

How to understand MySQL persistence and rollback

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article analyzes "how to understand MySQL persistence and rollback". The content is detailed and easy to understand. Friends who are interested in "how to understand MySQL persistence and rollback" can follow the editor's train of thought to read it. I hope it will be helpful to you after reading. Let's learn more about how to understand MySQL persistence and rollback with the editor.

Redo log

Transaction support is one of the important features of the database to distinguish file systems, and four major characteristics of transactions:

Atomicity: all operations are either done or not done, inseparable.

Consistency: the result of a database changing from one state to another is ultimately the same, for example, A transfers 500 to B, 500 less, An ends up 500 less, B ends up 500 more, but the value of Aban B remains the same.

Isolation: transactions and transactions are isolated from each other and do not interfere with each other.

Persistence: once a transaction is committed, its changes to the data are permanent.

This article focuses on persistence-related knowledge.

When we update a record in a transaction, such as:

Update user set age=11 where user_id=1

The process goes something like this:

First determine whether the page of user_id data is in memory, if not, read it from the database, and then load it into memory.

Modify the age in memory to 11

Write to redo log, and redo log is in prepare state

Write to binlog

Commit transaction, redo log becomes commit state

There are several key points here: what is redo log? Why do I need redo log? What is the redo log of the prepare status? Can redo log and binlog choose only one...? With this series of questions, let's unveil redo log.

Why update the in-memory data first, not the disk data directly?

Why don't we update the corresponding disk data directly every time we update the data? First of all, we know that disk IO is slow, memory is fast, and the speed of both is not of the same order of magnitude, so for slow disk IO, there is an index. Through indexing, even if there are millions of data on disk, we can still find our data very quickly on disk, this is the function of index. But the index also needs to be maintained, and it is not immutable. When we insert a new piece of data A, because this data is to be inserted after the existing data B, then we have to move the B data to make way for A, which has some overhead.

To make matters worse, the page you were supposed to insert is full, so you have to apply for a new page and move some of the data over it. This is called page splitting, which is more expensive. If our sql change is to directly modify the data on the disk, and the above problem happens to occur, then the efficiency will be very low, and if it is serious, it will cause a timeout. This is why the above update process loads the corresponding data page into memory first, and then updates the data in memory first. For mysql, all changes must first update the data in the buffer pool, and then the dirty pages in the buffer pool will be flushed to disk at a certain frequency (checkPoint mechanism), through the buffer pool to optimize the gap between CPU and disk, so as to ensure that the overall performance will not decline too fast.

Why do I need redo log?

The buffer pool can help us bridge the gap between CPU and disk, and the checkpoint mechanism can ensure that the data will eventually fall off the disk. However, because checkpoint is not triggered every time it is changed, it is handled by the master thread at regular intervals. So the worst-case scenario is that the buffer pool has just been written and the database is down, so this piece of data is lost and cannot be recovered. In this way, it does not satisfy the D in ACID. In order to solve the persistence problem in this case, the transaction of the InnoDB engine uses WAL technology (Write-Ahead Logging). The idea of this technology is to write the log first, and then write to the disk. Only if the log is written successfully can the transaction be considered to be successful. The log here is redo log. When downtime occurs and the data is not flushed to disk, it can be restored through redo log to ensure the D in ACID. This is the function of redo log.

How is redo log implemented?

Redo log is not written directly to disk. Redo log also has a buffer called redo log buffer (redo log buffer). When writing redo log, the InnoDB engine will first write redo log buffer, and then brush it into the real redo log at a certain frequency. Redo log buffer generally does not need to be very large, it is only a temporary container, and the master thread will brush the redo log buffer into the redo log file every second. So we just need to make sure that redo log buffer can store the amount of data of transaction changes within 1 second. Take mysql5.7.23 as an example, the default is 16m.

Mysql > show variables like'% innodb_log_buffer_size%' +-+-+ | Variable_name | Value | +-+-+ | innodb_log_buffer_size | 16777216 | +-+-+

The 16m buffer is sufficient for most applications. The main strategies for synchronizing buffer to redo log are as follows:

The master thread brushes buffer into redo log every second

Buffer will be brushed into redo log when each transaction is committed

When the remaining space of buffer is less than 1 / 2, it will be brushed into redo log.

It should be noted that the process of redo log buffer brushing to redo log is not really flushed to disk, but just brushed into os cache. This is an optimization made by modern operating systems to improve the efficiency of file writing, and the real writing will be left to the system to decide (for example, os cache is big enough). Then there is a problem for InnoDB. If the system is handed over to fsync, similarly, if the system goes down, then the data will be lost (although the probability of the whole system downtime is still relatively small). In response to this situation, InnoDB gives the innodb_flush_log_at_trx_commit policy and lets the user decide which one to use.

Mysql > show variables like 'innodb_flush_log_at_trx_commit' +-- +-+ | Variable_name | Value | +-+-+ | innodb_flush_log_at_trx_commit | 1 | +- -+

0: indicates that after the transaction is committed, the fsync is not performed, but the fysnc in which the master redoes the log every 1s

1: default, fsync is performed synchronously every time the transaction is committed

2: after writing os cache, leave it to the operating system to decide when to fsync.

In terms of three brush-in strategies:

2 is definitely the most efficient, but as long as the operating system goes down, the data in os cache will be lost. In this case, the D in ACID cannot be satisfied.

0 is a compromise, its IO efficiency theory is higher than 1, lower than 2, its data security theory is lower than 1, higher than 2, this strategy also has the risk of data loss, and there is no guarantee of D.

1 is the default value, can guarantee D, data will never be lost, but the least efficient. It is recommended to use the default value. Although the probability of operating system downtime is less than that of database downtime, generally speaking, since transactions are used, the security of data should be relatively more important.

Redo log is a physical modification to the page. The x position of page x is changed to xx, such as:

Page (2 ~ 4), offset 64 ~ ~ value 2

In the InnoDB engine, redo log is stored in 512-byte units, and each storage unit is called redo log block (redo log block). If the log stored in a page is greater than 512 bytes, it needs to be logically cut into multiple block for storage.

A redo log block consists of a log header, a log body, and a log tail. The log header takes up 12 bytes and the log tail takes 8 bytes, so the data that an block can really store is 512-12-8 bytes 492 bytes.

Multiple redo log block make up our redo log.

The default size of each redo log is 48m:

Mysql > show variables like 'innodb_log_file_size' +-- +-+ | Variable_name | Value | +-- +-+ | innodb_log_file_size | 50331648 | +-- +-+

InnoDB defaults to 2 redo log to form a log group, and this is the log group that really works.

When ib_logfile0 is finished, he will write ib_logfile1, and when ib_logfile1 is finished, he will rewrite ib_logfile0..., and write it in a loop all the time.

Why is a block designed to be 512 bytes?

This is related to the sector of the disk. The default sector of a mechanical disk is 512 bytes. If the data you want to write is greater than 512 bytes, then there must be more than one sector to write. This involves the rotation of the disk and finding the next sector. Suppose two sectors An and B need to be written now. If sector A writes successfully and sector B fails, then there will be non-atomic writes. If only 512 bytes are written at a time of the same size as the sector, then each write is atomic.

Why two-paragraph submission?

As we know from the above, the commit of a transaction must first write redo log (prepare), then write binlog, and then commit (commit). Why is there a prepare action here? Can't redo log directly commit status? Suppose that redo log submits directly, and crash occurs when writing binlog, and then binlog has no corresponding data, then all slave that rely on binlog to recover data will have no corresponding data, resulting in inconsistency between masters.

So it is necessary to ensure the consistency of redo log and binlog through two-stage (2pc) commit. The specific steps are as follows: the redo log in the prepare state will record the XID,binlog of the 2PC and the XID of the 2PC after it is written, and the commit logo will be marked on the redo log.

Can redo log and bin log need only one of them?

No. The size of redo log itself is fixed, and after it is full, it will start all over again and overwrite the old data. Because redo log cannot save all data, it is impossible to synchronize data to slave database through redo log in master-slave mode. Then binlog must be required. Binlog is generated by the server layer of mysql and has nothing to do with the storage engine. Binglog is also called archive log. When a binlog file is full, it will be written to a new binlog file.

So all we need is binlog? Can redo log not be needed? Of course, the role of redo log is to provide the ability of crash-safe. First of all, the modification of a data is to modify the data page in the buffer pool first. At this time, the modified data does not really fall off the disk. This is mainly because the discrete read and write capacity of the disk is inefficient, and the real work of falling disk is handled by the MasterThread on a regular basis. The advantage is that master can write multiple modifications to disk at one time.

Then there is a problem at this time: after the transaction commit, the data is in the dirty page of the buffer, and the data has not yet been brushed into the disk, and the database crashes, then the data of this commit cannot be restored even after the database is restored, and it does not meet the D in ACID, and then there is redo log. From the process point of view, the submission of a transaction must ensure the success of redo log writing. Transactions are committed only if redo log writes are successful, and redo log is mostly written to disk sequentially, so it is much more efficient. When crash occurs after commit, we can recover the data through redo log, which is why we need redo log.

But the transaction commit also needs to be written successfully by binlog, so why can't the unopened data be recovered through binlog? This is because binlog does not know which data is off the disk, so it does not know which data needs to be recovered. For redo log, the data in the corresponding redo log will be deleted after the data is discarded, so after the database is restarted, you only need to restore the remaining data in the redo log.

How do you recover after crash?

Through the two-stage commit, we know that redo log and binlog will be marked with prepare or commit at each stage, and the XID of the transaction will be recorded at the same time. With this data, when the database is restarted, we will first go to redo log to check all transactions. If the transaction of redo log is in the state of commit, then crash occurs after commit, and the data of redo log can be restored directly at this time. If redo log is in prepare state, It means that crash occurred before commit, and the status of the binlog determines the status of the current transaction. If there is a corresponding XID in the binlog, it means that the binlog has been written successfully, but it has not been committed, and the commit can be executed again. If the corresponding XID is not found in the binlog, then the crash is not written successfully, and the rollback should be performed at this time.

Undo log

Redo log is the guarantee of transaction persistence, and undo log is the guarantee of transaction atomicity. The pre-operation of updating data in a transaction is actually written to a undo log first, so its flow is roughly as follows:

Under what circumstances will undo log be generated?

The function of undo log is mvcc (multi-version control) and rollback. Here we mainly talk about rollback. When we insert, update, or delete certain data in a transaction, we will generate a corresponding undo log. When we perform a rollback, we can go back to the way the transaction started through undo log. It should be noted that the rollback is not a modified physical page, but a logical restore to the original appearance, such as a data A, which you changed to B in a transaction, but at this time another transaction has changed it to C. if the rollback directly modifies the data page to change the data to A, then C is overwritten.

For the InnoDB engine, each row record has several hidden columns in addition to its own data:

DB_ROW_ID: if no primary key is explicitly defined for the table, and there is no unique index defined in the table, InnoDB automatically adds a hidden column of row_id as the primary key for the table.

DB_TRX_ID: each transaction allocates a transaction ID, and when a change occurs to a record, the transaction ID for that transaction is written to trx_id.

DB_ROLL_PTR: rollback the pointer, which is essentially a pointer to undo log.

When we execute INSERT:

Begin;INSERT INTO user (name) VALUES ("tom")

All inserted data will generate an insert undo log, and the data's rollback pointer will point to it. Undo log will record the sequence number of the undo log, the columns and values of the primary key inserted. So when doing rollback, you can delete the corresponding data directly through the primary key.

For updated operations that generate update undo log and will be divided into updated primary key and non-updated primary key, suppose you do it now:

UPDATE user SET name= "Sun" WHERE id=1

The old record is written to the new undo log so that the rollback pointer points to the new undo log, whose undo no is 1, and the new undo log points to the old undo log (undo no=0).

Suppose you execute now:

UPDATE user SET id=2 WHERE id=1

For the operation of updating the primary key, the original data deletemark ID will be opened first, and there is no real data deletion. The real deletion will be judged by the cleaning thread, and then a new data will be inserted. The new data will also generate undo log, and the serial number of the undo log will be incremented.

It can be found that each change to the data will generate a undo log, when a record is changed many times, then multiple undo log,undo log records the log before the change, and the sequence number of each undo log is incremented, so when it is time to roll back, push forward according to the sequence number to find our original data.

How does undo log roll back?

In the above example, assuming that rollback is executed, the corresponding process should look like this:

Delete id=2 data through undo no=3 log

Restore the deletemark of id=1 data to 0 through the log of undo no=2

Restore the name of id=1 data to Tom through the log of undo no=1

Delete id=1 data through undo no=0 log

Where is the undo log stored?

InnoDB uses segments to manage undo log, that is, rollback segments. Each rollback segment records 1024 rollback segments. Undo log segment,InnoDB engine supports 128rollback segments by default

Mysql > show variables like 'innodb_undo_logs';+-+-+ | Variable_name | Value | +-+-+ | innodb_undo_logs | 128 | +-+-+

Then the largest concurrent transaction that can be supported is 128 transactions 1024. Each undo log segment is like maintaining an array of 1024 elements.

When we start a transaction and need to write undo log, we have to go to the undo log segment to find a free location. When there is a vacancy, we will apply for the undo page, and finally write the undo log in the applied undo page. We know that the default page size for mysql is 16k.

Mysql > show variables like'% innodb_page_size%';+-+-+ | Variable_name | Value | +-- + | innodb_page_size | 16384 | +-+-+

So allocating a page for a transaction is actually very wasteful (unless your things are very long). Suppose your application's TPS is 1000, then you need 1000 pages per second, about 16m of storage, and about 1G of storage per minute. If it goes on like this, unless mysql cleans up very diligently, disk space will grow very fast over time, and a lot of space will be wasted.

So the undo page is designed to be reused. When a transaction commits, the undo page is not deleted immediately. Because of reuse, this undo page may not be clean, so this undo page may be mixed with the undo log of other transactions. After commit, undo log will be put into a linked list, and then determine whether the usage space of the undo page is less than 3 undo 4, if it is less than 3 big 4, it means that the current undo page can be reused, then it will not be recycled, and the undo log of other transactions can be recorded at the end of the current undo page. Because undo log is discrete, it is not so efficient to clean up the corresponding disk space.

This is enough about how to understand MySQL persistence and rollback. I hope the above content can improve everyone. If you want to learn more knowledge, please pay more attention to the editor's updates. Thank you for following the website!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.