What is the crash-safe principle of MySQL? 07/13 Update SLTechnology News&Howtos

What is the crash-safe principle of MySQL?

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "what is the crash-safe principle of MySQL". The content of the explanation is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "what is the crash-safe principle of MySQL".

As the most popular open source relational database, MySQL has a key and basic ability, that is, it must be able to ensure that the data will not be lost. So behind this capability, how is MySQL designed to ensure that data will not be lost no matter when it crashes and recovers? What are the key technologies that support this capability? This article will reveal for us one by one.

I. Preface

MySQL's ability to ensure that data will not be lost is mainly reflected in two aspects:

Be able to recover to any point in time

It can guarantee that MySQL suddenly collapses at any time, and the records submitted before the restart will not be lost.

For the first point to restore MySQL to any point in time, I believe many people know that as long as there is enough binlog, it can be achieved by rerunning binlog.

For the ability of the second point, which is the crash-safe mentioned in the title of this article. That is, in the InnoDB storage engine, the MySQL suddenly collapses at any stage of the transaction commit process, and the integrity of the transaction can be guaranteed after restart. The committed data will not be lost, and the uncommitted complete data will be rolled back automatically. This capability relies on redo log and unod log logs.

Because crash-safe is mainly reflected in the sudden collapse in the process of transaction execution, and the transaction integrity can be guaranteed after restart, so before explaining the specific principles, we should first understand the key stages of MySQL transaction execution, and then we can parse according to these stages. Let's take the execution process of an update statement as an example. Without saying much, let's go straight to the figure above:

From the figure above, you can clearly see how an update statement is executed in MySQL. Briefly summarize:

Find this data record from memory and update it

Record changes to the data page to redo log

Record logical operations to binlog

For the data and logs in memory, the background thread is used to flush the disk asynchronously when the falling rule is triggered.

The above demonstrates the detailed execution of an update statement, and then we analyze the design principle of the crash-safe by answering questions with questions.

II. WAL mechanism

Question: why not directly change the data on the disk, but change it in memory, and then you need to write a log, and finally the disk is so complicated?

I believe many students can guess this question, when MySQL changes the data, the main reason why it does not write the data in the disk file directly is the performance problem. Because writing disk files directly is random, the overhead is high and the performance is low, so it is impossible to meet the performance requirements of MySQL. That's why it's designed to change the data in memory first, and then drop the disk asynchronously. However, the memory is always unreliable. In case of power outage and restart, the memory data will be lost before the disk is dropped, so you also need to add the step of writing logs. In case of power outage and restart, you can also recover through the records in the log.

Although logging is also written to disk, it is written sequentially, which costs less than random writing, and can improve the performance of statement execution (for why sequential writing is faster than random writing, it can be likened to that you have a notebook. Writing page by page in order must be much faster than finding the corresponding page to write a word).

This technology is the WAL (Write Ahead Log) technology that most storage systems basically use, also known as log first technology, which means that changes must be logged before making changes to the data file. It ensures data consistency and persistence, and improves the performance of statement execution.

Third, the core log module

Question: update SQL statement execution process, a total of 3 logs need to be written, these three are not needed, can it be simplified?

In the process of updating SQL, a total of three core logs in the MySQL log module are involved, namely, redo log (redo log), undo log (rollback log) and binlog (archive log). It is predicted in advance that the capabilities of crash-safe mainly rely on these three logs.

Next, the role of each log will be described separately, and then whether it can be simplified or not will be evaluated.

1. Redo log redo log

Redo log, also known as transaction logs, is generated by the InnoDB storage engine layer. What is recorded is the modification of each page in the database, not how a row or rows are modified, and can be used to restore the physical data page after submission (restore the data page, and can only be restored to the location where it was last committed, because the changes will overwrite the previous one).

The WAL technology mentioned earlier, redo log is a typical application of WAL. When a transaction commits a change to the data, MySQL will only modify the corresponding data page in memory and record the redo log log. After completion, the transaction is successfully committed, while the update of the disk data file is handled asynchronously by the background thread. Due to the addition of redo log, it ensures the consistency and persistence of MySQL data (even if the MySQL collapses before data flushing, it can still be replayed through the change records in redo log after restart, and can also improve the execution performance of statements (writing redo log is sequential writing, which can significantly improve statement execution performance and concurrency compared with random writing of updated data files). It can be seen that redo log is essential.

Redo log is of a fixed size, so it can only be written in a loop, starting from the beginning, and then going back to the beginning at the end, which is equivalent to a circle. When the log is full, you need to erase the old records, but before erasing, you need to make sure that the data pages corresponding to the erased records have been brushed to disk. During the period when the redo log is full and the old records are erased to make room for new ones, no new update requests can be received, so it is possible to cause MySQL stutters. Therefore, for systems with large concurrency, it is very important to set the file size of redo log properly! )

2. Roll back the log undo log

As the name implies, undo log mainly provides the function of rollback, but it also has another major function, that is, multiple row versioning (MVCC) to ensure the atomicity of transactions. In the process of data modification, a logical log opposite to the current operation will be recorded in undo log (it can be thought that when delete a record, a corresponding insert record will be recorded in undo log, and vice versa, when a record is update, it records a corresponding update record). If a transaction fails abnormally for some reason, it can be rolled back with the undo log to ensure the integrity of the transaction. So undo log is also essential.

3. Archive log binlog

Binlog is generated at the server layer of MySQL and does not belong to any engine. It mainly records the SQL statements (except query statements) that users perform on the database. Binlog is called an archive log because binlog does not erase the previous record cycle like redo log, but keeps recording (it will be cleaned up until it expires). If it exceeds the maximum value of a single log (the default is 1G, which can be set through the variable max_binlog_size), a new file will continue to be recorded. However, because the log may be based on transactions (such as the InnoDB table type), and transactions are absolutely impossible and should not be recorded across files, if the binlog log file reaches the maximum but the transaction has not yet committed, it will not switch to the new file record, but will continue to increase the log, so the value specified by max_binlog_size is not necessarily equal to the actual binlog log size.

It is precisely because binlog has the function of archiving that binlog is mainly used for master-slave synchronization and database point-in-time restore.

So back to the question, can binlog be simplified? Here, we need to look at the scene by scene:

If it is in master-slave mode, binlog is necessary because the data synchronization of the slave library depends on binlog.

If you are in stand-alone mode and do not take into account the point-in-time restore of the database, binlog is not necessary, because you can guarantee crash-safe capability with redo log, but if you need to roll back to a state at a certain point in time, there is nothing you can do about it, so it is recommended that binlog be turned on all the time.

According to the detailed explanation of the three logs above, we can answer this question: in master-slave mode, all three logs are necessary; in stand-alone mode, binlog can be turned on on a case-by-case basis.

IV. Two-stage submission

Question: why does redo log write in two steps, interspersed with binlog?

As can be seen from the above, because redo log affects the data of the master database and binlog affects the data of the slave database, redo log and binlog must be consistent in order to ensure the consistency of master and slave data, which is the premise.

I believe that many students with development experience know distributed transactions. Redo log and binlog here are actually very typical distributed transaction scenarios, because they are two independent individuals. If they want to be consistent, they must be processed by distributed transaction solutions. Redo log is divided into two steps, which is actually using the two-phase commit protocol (Two-phase Commit,2PC).

Let's simplify the execution process of the update statement and take a look at how MySQL's two-phase commit is implemented:

As you can see from the figure, there are two stages in the transaction commit process, that is, the write to redo log is split into two steps: prepare and commit, interspersed with writes to binlog.

If you are confused at this time, why do you have to use two-phase commit? what will happen if you do not use two-phase commit, for example, write redo log first, then binlog or binlog first, and then redo log? Next we use the counterproof method to carry on the argument.

Let's continue with the example of update T set c=c+1 where id=2, assuming that the initial value of c for id=2 is 0. So when the redo log is finished and the binlog is not finished, the MySQL process restarts abnormally. Since the redo log has been written, the data will be recovered through redo log after the system is rebooted, so the value of this Secretc after recovery is 1. However, because the binlog was not finished, the word "crash" was not recorded in binlog noodles at this time. Therefore, neither the current slave library nor the later restore temporary library through this binlog does not have this update, and the value of c is still 0, which is different from that of the original library.

By the same token, if you write binlog first, then redo log, the midway system crash, it will also lead to inconsistency, so I won't go into details here.

Therefore, dividing the redo log into two steps, that is, two-phase submission, can ensure that the contents of redo log and binlog are consistent, thus ensuring the consistency of master and slave data.

Although two-phase commit can guarantee the consistency of the contents of a single transaction and two logs, it cannot guarantee the same commit order in the case of multiple transactions. For example, in the following example, suppose that three transactions are committed at the same time:

T1 (--prepare--binlog-commit) T2 (- prepare-binlog----commit) T3 (- prepare-binlog-commit)

Parsing:

The order of redo log prepare: T1-"T2--" T3

Binlog write order: T1-"T2--" T3

The order of redo log commit: T2-"T3--" T1

Conclusion: because the order of binlog writing is inconsistent with that of redo log commit, the order of transaction commit ending recorded by binlog and redo log is different, and the final result is the inconsistency of master and slave data.

Therefore, on the basis of the two-phase commit process, we also need to add a lock to ensure the atomicity of the commit, so as to ensure that in the case of multiple transactions, the commit order of the two logs is the same. So in earlier versions of MySQL, by using prepare_commit_mutex locks to ensure the order of transaction commits, one transaction could not enter the prepare until the lock was acquired, and the lock could not be released until the end of the commit, and the next transaction could not continue with the prepare operation. Although the problem of sequence consistency is perfectly solved by adding locks, when the concurrency is large, it will lead to contention for locks and poor performance. In addition to the performance impact of lock contention, there is another point that has a greater impact on performance, that is, each transaction commit will be fsync (write disk) twice, one is redo log disk, and the other is binlog disk. As we all know, writing to disk is an expensive operation, and for ordinary disks, the QPS per second is about a few hundred.

V. Group submission

Question: is there a better solution to the performance bottleneck encountered in the implementation of controlling transaction commit order by locking in two-phase commit?

Of course, the answer is yes. Binlog group submission, or BLGC (Binary Log Group Commit), was introduced in MySQL 5.6. The basic idea of binlog group commit is to introduce a queue mechanism to ensure that the InnoDB commit order is consistent with the binlog order, and to group transactions, and the binlog disk brushing action in the group is handed over to a transaction to achieve the purpose of group commit. As shown in the figure:

The first phase (prepare phase):

Hold the prepare_commit_mutex, and write/fsync redo log to disk, set to the prepared state, and then release the prepare_commit_mutex,binlog and do nothing.

The second phase (commit phase): this is split into three steps, and each step of the task is assigned to a special thread:

Flush Stage (write to binlog cache)

① holds Lock_log mutex [leader holds, follower waits]

② gets a set of binlog in the queue (all transactions in the queue)

③ writes to binlog cache

Sync Stage (remove binlog from disk)

① releases Lock_log mutex and holds Lock_sync mutex [leader holds, follower waits]

② unloads a set of binlog (fsync action is the most time-consuming, assuming that sync_binlog is 1).

Commit Stage (InnoDB commit, clear undo message)

① releases Lock_sync mutex and holds Lock_commit mutex [leader holds, follower waits]

② traverses the transactions in the queue, InnoDB commit them one by one

③ releases Lock_commit mutex

Each Stage has its own queue, the first transaction in the queue is called leader, and the other transactions are called follower,leader that control the behavior of follower. Each queue has its own mutex protection, and the queues are sequential. Only after the flush is completed can you enter the queue in the sync phase; after the sync is completed, you can enter the queue in the commit phase. But the jobs of these three stages can be executed concurrently, that is, when a group of transactions is in the commit phase, other new transactions can carry out the flush phase, which realizes the real group commit and greatly reduces the IOPS consumption of the disk.

This paper briefly summarizes why group commit has better locking performance than two-phase commit: although group commit still retains the prepare_commit_mutex lock in each queue, the lock granularity becomes smaller and becomes the same as the original two-phase commit, so the contention of the lock will be greatly reduced; in addition, group submission is a batch flush, which can greatly reduce the IO consumption of the disk compared to the previous single record.

VI. Data recovery process

Question: suppose that during the transaction commit, the MySQL process suddenly crashes, how to ensure that the data is not lost after restart?

The following figure shows the process of data recovery that will be done before providing services after MySQL restarts:

A simple description of the above figure is as follows: after a crash and restart, the transactions in redo log that are complete and in prepare state are checked, and then according to XID (transaction ID), the corresponding transaction is found in binlog, and if it cannot be found, it is rolled back; if it is found and the transaction is complete, the transaction is re-commit redo log to complete the transaction commit.

Next, according to the transaction commit process, at different stages, we will see how the data is recovered according to the above process after the sudden collapse of MySQL.

Time A (crashes just after changing the data page in memory and has not started to write redo log):

Because the dirty pages in memory have not been brushed, and redo log and binlog have not been written, that is, the transaction has not yet started to commit, so crash recovery has nothing to do with the transaction.

Time B (when you are writing redo log or you are in prepare state after you have finished writing redo log and have fallen off the disk, and have not started to write binlog):

After recovery, it will determine whether the transaction of redo log is complete, if not, roll back according to undo log; if it is complete and prepare status, then further determine whether the corresponding transaction binlog is complete, if incomplete, roll back according to undo log as well

Time C (when you are writing binlog or have finished writing binlog and have closed the disk, and have not yet started commit redo log):

After recovery, just like time B, first check the transaction in redo log that is complete and in prepare state, and then determine whether the corresponding transaction binlog is complete. If it is incomplete, it will be rolled back according to undo log, and if it is complete, commit redo log again.

Time D (when the commit redo log or transaction has been committed, and the feedback has not been successfully sent to the client):

After recovery, basically the same as time C, it will confirm whether to roll back or resubmit against the transaction integrity of redo log and binlog.

VII. Summary

At this point, I have basically finished the details of MySQL's crash-safe principle, so let's briefly review:

First of all, this paper briefly introduces the advance technology of WAL log, including its definition, process and function. WAL is a common design pattern for most database systems to achieve consistency and persistence.

Then the log module of MySQL, redo log, undo log, binlog, two-phase commit and group commit are introduced in detail.

Finally, the data recovery process is explained and verified at different times.

Thank you for reading, the above is the content of "what is the crash-safe principle of MySQL". After the study of this article, I believe you have a deeper understanding of what the crash-safe principle of MySQL is, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.