How does MySQL ensure that data is not lost 04/10 Update SLTechnology News&Howtos

How does MySQL ensure that data is not lost

2025-04-10 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article focuses on "how MySQL ensures that data is not lost". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "how MySQL ensures that data is not lost".

Fig. 1 schematic diagram of two-phase submission

Isn't this diagram the execution flow of a update statement? how can it still be called? The reason is that the two concepts of "commit" are confused:

The usual commit statement refers to the command used to commit a transaction in MySQL syntax. It is usually paired with begin/start transaction.

The commit step used in our figure refers to a small and final step in the transaction commit process. When this step is completed, the transaction is committed.

When the commit statement is executed, the commit step is included.

In our example, there is no explicit opening of the transaction, so the update statement is itself a transaction, and this "commit step" is used when the transaction is committed after execution.

Next, let's analyze what happens when an abnormal restart of MySQL occurs at different times of the two-phase commit.

If a crash occurs at time An in the figure, that is, after the write redo log is in the prepare phase and before the binlog is written, the transaction will be rolled back when the crash resumes because the binlog has not been written and the redo log has not been committed. At this time, binlog has not been written, so it will not be transferred to the repository. At this point, everyone can understand.

If binlog is finished and crash occurs before redo log commit, what will MySQL do when it crashes and recovers?

Let's first take a look at the rules of judgment at the time of crash recovery.

If the transaction in the redo log is complete, that is, it already has a commit identity, commit it directly.

If the transaction in the redo log has only a complete prepare, determine whether the corresponding transaction binlog exists and is complete:

a. If so, commit the transaction

b. Otherwise, roll back the transaction.

Here, the occurrence of crash at time B corresponds to the case of 2 (a), and the transaction is committed during the crash recovery process.

Now, let's continue to extend the problem.

How does 1:MySQL know that binlog is complete?

Answer: the binlog of a transaction has a full format:

Binlog in statement format, and finally COMMIT.

Binlog in row format will end up with a XID event.

In addition, after MySQL 5.6.2, the binlog-checksum parameter was introduced to verify the correctness of the binlog content. For cases where binlog logs may make errors in the middle of the logs due to disk reasons, MySQL can find out by verifying the results of checksum. Therefore, MySQL still has a way to verify the integrity of the transaction binlog.

How are 2:redo log and binlog related?

Answer: they have a common data field called XID. When the crash recovers, the redo log is scanned sequentially:

If you encounter a redo log with both prepare and commit, submit it directly.

If you encounter a redo log with only parepare but no commit, take the XID to binlog to find the corresponding transaction.

Follow-up question 3: redo log in the prepare phase plus full binlog can be restored by rebooting. Why is MySQL designed in this way?

Answer: in fact, this question is still related to the consistency of data and backup that we mentioned in the absurdity. At time B, that is, after binlog has finished writing, the MySQL crashes, by which time the binlog has been written, and then it will be used from the library (or the library recovered with this binlog).

Therefore, this transaction should also be committed on the main library. With this strategy, the data of the main database and the standby database are consistent.

Follow-up question 4: if so, why two-phase submission? Just finish redo log first, then write binlog. When you recover from a crash, you must have both logs complete. Is it the same logic?

Answer: in fact, two-phase commit is a classic distributed system problem, not unique to MySQL.

If there is one scenario that must be cited to illustrate the need for this, it is the persistence of transactions.

For the InnoDB engine, if the redo log commit is completed, the transaction cannot be rolled back (if this also allows rollback, updates for other transactions may be overwritten). If redo log commits directly, and then binlog fails to write, InnoDB cannot roll back, and the data is inconsistent with the binlog log.

The purpose of two-phase submission is to give everyone a chance to submit together when everyone says "I ok".

Follow-up question 5: without the introduction of two logs, there is no need for two-phase commit. Just use binlog to support crash recovery and archiving, isn't it?

Answer: what this student means is that only binlog can be retained, and then the submission process can be changed like this: … -> "data update to memory"-> "write binlog"-> "commit transaction", can it also provide crash recovery?

The answer is no.

If there is any historical reason, it is that InnoDB is not the native storage engine of MySQL. The native engine of MySQL is MyISAM, and there is no support for crash recovery at the beginning of the design.

InnoDB was an engine that provided crash recovery and transaction support before it was added to the MySQL engine family as a plug-in to MySQL.

After InnoDB connects to MySQL, it is found that since binlog does not have the ability to crash and recover, it is better to use InnoDB's original redo log.

And if there are reasons for realization, there are many. According to the question, only using binlog to achieve crash recovery process, I drew a diagram, there is no redo log here.

Figure 2 crash recovery only supported by binlog

Under this process, binlog still cannot support crash recovery. Let me say one point that is not supported: binlog does not have the ability to recover the "data page".

If at the location marked in the diagram, that is, when binlog2 is finished, but the whole transaction does not have commit, crash occurs in MySQL.

After restart, engine internal transaction 2 will be rolled back, and then binlog2 can be applied to fix it; but for transaction 1, the system already thinks that the commit is complete, and binlog1 will not be applied again.

However, the InnoDB engine uses WAL technology, and when a transaction is executed, the transaction is completed after writing memory and logs. If it crashes later, rely on the log to recover the data page.

In other words, if a crash occurs at this location in the figure, transaction 1 may also be lost, and it is a loss at the data page level. At this time, binlog does not record the update details of the data page, which cannot be made up.

If you want to say, can I optimize the content of binlog and let it record the changes to the data page? However, this is actually making another redo log.

So, at least for now, the binlog capability doesn't support crash recovery.

Follow-up question 6: can it be the other way around, just redo log, not binlog?

Answer: it is possible only from the point of view of crash recovery. You can turn off binlog so that there is no two-phase commit, but the system is still crash-safe.

However, if you look at the usage scenarios of various companies in the industry, you will find that binlog is open on the official production repository. Because binlog has a function that redo log can't replace.

One is filing. Redo log is written in a loop, and you have to go back to the beginning to write at the end. In this way, the history log cannot be kept, and the redo log cannot be archived.

One is that the MySQL system depends on binlog. As a function of MySQL from the very beginning, binlog has been used in many places. Among them, the foundation of high availability of MySQL system is binlog replication.

There are also many companies that have heterogeneous systems (such as data analysis systems) that update their data by consuming MySQL's binlog. If you turn off binlog, these downstream systems will not be able to enter.

In short, because many system mechanisms, including the high availability of MySQL, rely on binlog, the "dove" redo log can't do it yet. You see, how important it is to develop ecology.

How big is the general setting of 7:redo log?

Answer: if the redo log is too small, it will soon be full, and then have to brush redo log forcefully, so that the ability of the WAL mechanism can not be brought into full play.

So, if it is now common several TB disks, do not be too stingy, directly set the redo log to 4 files, each file 1GB it.

Follow-up question 8: for an instance in normal operation, is the final disk set after data writing updated from redo log or buffer pool?

Answer: this is actually a very good question. Here comes the question of "what's in the redo log?"

In fact, redo log does not record the complete data of the data page, so it does not have the ability to update the disk data page on its own, so there is no case that "the data eventually falls off the disk and is updated by redo log."

If it is a normal running instance, after the data page is modified, it is inconsistent with the data page of the disk, which is called a dirty page. In the end, when the data is on the disk, it is to write the data pages in memory to the disk. This process even has nothing to do with redo log.

In a crash recovery scenario, if InnoDB determines that a data page may have lost updates during crash recovery, it reads it into memory and asks redo log to update the memory contents. When the update is complete, the memory page becomes dirty and returns to the state of the first case.

What is 9:redo log buffer? Do you want to modify the memory or write the redo log file first?

During the update process of a transaction, the log is written multiple times. For example, the following transaction:

Begin;insert into t1... insert into t2... commit

This transaction inserts records into two tables, and during the process of inserting data, the generated logs have to be saved first, but they cannot be written directly to the redo log file before commit.

Therefore, redo log buffer is a piece of memory that is used to store redo logs first. That is, when the first insert is executed, the memory of the data is modified and the redo log buffer is written to the log.

However, actually writing the log to the redo log file (the file name is the ib_logfile+ number) is done when the commit statement is executed.

(what is said here is that you will not "actively flush the disk" during transaction execution to reduce unnecessary IO consumption. However, there may be "passive write to disk", such as insufficient memory, other transaction commits, and so on.

When a separate update statement is executed, InnoDB starts a transaction itself and commits when the statement execution is complete. The process is the same as above, except that it is "compressed" into a statement.

The above questions are a concentrated answer to the questions you have asked about redo log and binlog. If you have any more questions, you can continue to add messages in the comments section.

The full name of WAL is Write-Ahead Logging, and its key point is to write a log before writing to disk. The core mechanism to improve performance is indeed to minimize random reads and writes.

Conclusion: as long as redo log and binlog are guaranteed to persist to disk, you can ensure that data can be recovered after an abnormal restart of MySQL.

But what is the writing process of redo log, and how to ensure that redo log is actually written to disk.

The Writing Mechanism of binlog

In fact, the writing logic of binlog is relatively simple: during the execution of a transaction, the log is written to binlog cache first, and then binlog cache is written to the binlog file when the transaction is committed.

The binlog of a transaction cannot be disassembled, so be sure to write once no matter how large the transaction is. This involves the preservation of binlog cache.

The system allocates a piece of memory to binlog cache, one for each thread, and the parameter binlog_cache_size is used to control the amount of memory occupied by binlog cache in a single thread. If the size specified by this parameter is exceeded, it will be temporarily saved to disk.

When the transaction commits, the executor writes the complete transaction in the binlog cache to the binlog and empties the binlog cache. The status is shown in figure 1.

Figure 1 binlog write disk status

As you can see, each thread has its own binlog cache, but shares the same binlog file.

The write in the figure refers to the page cache that writes logs to the file system and does not persist the data to disk, so it is relatively fast.

The fsync in the figure is the operation of persisting data to disk. In general, we think that fsync accounts for the IOPS of the disk.

The timing of write and fsync is controlled by the parameter sync_binlog:

When sync_binlog=0, it means that each transaction committed is only write, not fsync.

When sync_binlog=1, indicates that fsync is executed every time a transaction is committed

When sync_binlog=N (N > 1), it means that the transaction is write every time it is committed, but only fsync after accumulating N transactions.

Therefore, in scenarios where IO bottlenecks occur, setting sync_binlog to a larger value can improve performance. In actual business scenarios, considering the controllability of the number of lost logs, it is generally not recommended to set this parameter to 0. It is more common to set it to one of the values in 100mm 1000.

However, if sync_binlog is set to N, the risk is that if the host restarts abnormally, the binlog logs for the last N transactions will be lost.

The Writing Mechanism of redo log

Next, let's talk about the writing mechanism of redo log. Redo log buffer was introduced above. During the execution of a transaction, the generated redo log is written to redo log buffer first.

Then a classmate asked whether the contents of redo log buffer should be persisted directly to disk after each generation. The answer is no.

If an abnormal restart of the MySQL occurs during transaction execution, this part of the log is lost. Since the transaction is not committed, there is no loss if the log is lost at this time.

So, another question is, is it possible for some of the logs in redo log buffer to be persisted to disk before the transaction is committed?

The answer is, there will be.

This problem should start with the three possible states of redo log. These three states correspond to the three color blocks in figure 2.

Figure 3 redo log group submission

You can see from the picture

Trx1 is the first to arrive and will be selected as the leader for this group.

By the time trx1 starts to write the disk, there are already three transactions in the group, and the LSN has become 160.

When trx1 writes to the disk, it takes LSN=160, so by the time trx1 returns, all redo log with LSN less than or equal to 160have been persisted to disk.

At this point, trx2 and trx3 can return directly.

Therefore, in a group submission, the more team members, the better the effect of saving disk IOPS. But if there is only a single-threaded stress test, then you can only honestly make a persistence operation for each transaction.

In the concurrent update scenario, after the first transaction has finished writing the redo log buffer, the later the next fsync is called, the more team members may be, and the better the effect of saving IOPS.

To make fsync bring more team members at a time, MySQL has an interesting optimization: procrastination. When I introduced the two-phase submission, I drew a picture for you, and now I'll cut it off.

Figure 5 two-phase submission refinement

In this way, the binlog can also be submitted by the group. When performing step 4 in figure 5 to put the binlog fsync to disk, if the binlog of multiple transactions has been written, it is also persisted together, which can also reduce IOPS consumption.

However, step 3 is usually executed very quickly, so the interval between write and fsync of binlog is short, resulting in fewer binlog that can be grouped together for persistence, so the group submission effect of binlog is usually not as good as that of redo log.

If you want to improve the effect of binlog group submission, you can do so by setting binlog_group_commit_sync_delay and binlog_group_commit_sync_no_delay_count.

Binlog_group_commit_sync_delay parameter, indicating how many microseconds before calling fsync

The binlog_group_commit_sync_no_delay_count parameter, which indicates how many times to accumulate before calling fsync.

These two conditions are the relationship of or, that is, fsync will be called as long as one of the conditions is satisfied.

So, when binlog_group_commit_sync_delay is set to 0, binlog_group_commit_sync_no_delay_count is also invalid.

Earlier, a classmate asked in the comments section that the WAL mechanism is to reduce disk writes, but every time you commit a transaction, you have to write redo log and binlog.

Now you can understand that the WAL mechanism mainly benefits from two aspects:

Both redo log and binlog are written sequentially, and sequential writing on disk is faster than random writing.

The group commit mechanism can greatly reduce the IOPS consumption of the disk.

At this point, let's answer this question: if your MySQL now has a performance bottleneck, and the bottleneck is on IO, what can you do to improve performance?

To solve this problem, you can consider the following three ways:

Set binlog_group_commit_sync_delay and binlog_group_commit_sync_no_delay_count parameters to reduce the number of binlog writes. This approach is based on "extra deliberate waiting", so it may increase the response time of the statement, but there is no risk of data loss.

Set sync_binlog to a value greater than 1 (more commonly, 100-1000). The risk is that binlog logs will be lost when the host is powered off.

Set innodb_flush_log_at_trx_commit to 2. The risk is that data will be lost when the mainframe is powered off.

I don't recommend that you set innodb_flush_log_at_trx_commit to 0. Because setting this parameter to 0 means that the redo log is only stored in memory, so the abnormal restart of MySQL itself will also lose data, which is too risky. Redo log is also very fast to write to the page cache of the file system, so setting this parameter to 2 is about the same as setting it to 0, but if you do an abnormal restart of MySQL, you will not lose data, so the risk will be less.

Summary

In articles 2 and 15 of the column, I analyzed with you how MySQL guarantees crash-safe if redo log and binlog are complete. In today's article, I focus on introducing to you that MySQL is "how to make sure that redo log and binlog are complete".

I hope the contents of these three articles will give you a clearer understanding of the concept of crash-safe.

After the release of the 15th question and answer article, some students continued to leave messages and asked some questions related to the journal. Here, in order to facilitate your review and study, I will focus on answering these questions again.

Question1: after executing a update statement, I execute the hexdump command to view the contents of the ibd file directly. Why don't I see any change in the data?

Answer: this may be due to the WAL mechanism. After the execution of the update statement, InnoDB only ensures that it has written the redo log and memory, and may not have had time to write the data to disk.

Question 2: why is binlog cache maintained by each thread and redo log buffer shared globally?

Answer: the main reason for MySQL's design is that binlog cannot be "interrupted". The binlog of a transaction must be written continuously, so when the whole transaction is completed, it must be written together into a file.

However, redo log does not have this requirement, and there are generated logs that can be written to redo log buffer. The content in redo log buffer can also be "hitchhiked", and other transactions can be written to disk when they are committed.

Question3: during the execution of the transaction, it has not yet reached the commit stage. If crash occurs, redo log must be lost. Will this lead to inconsistency between master and backup?

Answer: no. Because at this time, binlog is still in binlog cache and has not been sent to the standby database. After crash, both redo log and binlog are gone, and from a business perspective, the transaction is not committed, so the data is consistent.

Question 4: if crash occurs after the binlog has finished writing the disk, it will restart without giving a reply to the client. Wait for the client to reconnect and find that the transaction has been committed successfully. Is this bug?

Answer: no.

You can imagine a more extreme scenario where the whole transaction is successfully committed, the redo log commit is completed, and the repository receives the binlog and executes. However, the main library and the client network are disconnected, and the packets that cause successful transactions do not return, and the client will also receive a "network disconnection" exception. This can only be regarded as a successful transaction, not a bug.

In fact, the crash-safe of the database guarantees:

If the client receives a message that the transaction is successful, the transaction must be persisted

If the client receives a message that the transaction failed (such as primary key conflict, rollback, etc.), the transaction must fail

If the client receives an "execution exception" message, the application needs to reconnect and continue the subsequent logic by querying the current status. At this point, the database only needs to ensure that it is consistent internally (between the data and the log, and between the main database and the standby database).

At this point, I believe you have a deeper understanding of "how MySQL ensures that data is not lost". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.