What are the features of the MySQ engine 04/30 Update SLTechnology News&Howtos

What are the features of the MySQ engine

2025-04-30 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article introduces the relevant knowledge of "what are the characteristics of MySQ engine". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Basic knowledge

If users modify the data in the database, they must ensure that the log goes down before the data. After the log is set, you can return the operation to the user successfully, and there is no need to guarantee that the changes to the data at that time are also down. If the database is crash before the log is dropped, the corresponding data modifications will be rolled back. After the log disk is set, crash will ensure that the corresponding modifications will not be lost. It is important to note that although the operation can be returned to the user after the log is set to disk, there is a small time difference between setting the disk and returning the success packet, so even if the user does not receive a success message, the modification may have been successful. At this time, you need to query the database again to determine the current status. Before the log advance technology, the database only needs to brush the modified data back to the disk. With this technology, it is necessary to write one more log in addition to the modified data, that is, the amount of disk writing increases. However, because the logs are sequential and often stored in memory first and then refreshed to the disk in batches, the log writing overhead is relatively small compared with the discrete writing of data. Log advance technology has two problems that need to be solved in engineering:

There is a problem with the log flushing. Since all changes to the data need to be logged, when the concurrency is very large, it will inevitably lead to a large amount of log writes. In order to consider performance, it is often necessary to write to a log buffer first, and then brush into the disk according to certain rules. In addition, the log buffer size is limited, users will continue to produce logs, and the database also needs to constantly brush the logs in the cache area into the disk. The cache area can be reused, so a typical producer and consumer model is formed here. Modern databases must face this problem directly. In the case of high concurrency, it must be a performance bottleneck and a hot spot of lock conflicts.

Data flushing problem. When the user receives the operation successfully, the user's data may not have been persisted, and it is very likely that the modification has not been set up, which requires the database to have a mechanism for brushing the data, which is technically called the dirty page algorithm. Dirty pages (data pages that have been modified in memory but not yet off the disk) continue to be generated, and then continue to be brushed into the disk, here to form a producer-consumer model, affecting the performance of the database. If the dirty page is not flushed to disk, but the database is abnormal crash, this needs to be recovered. The specific process is that before accepting the user's request, scan the log from the checkpoint point (the data page corresponding to the log before this point must have been persisted to disk), and then apply the log to retrieve the updates lost in memory, and finally flush it back to disk. Here is a very important point: during the normal startup of the database, how does checkpoint determine that if the checkpoint is done slowly, it will lead to a too long recovery time, which will affect the availability of the database? if you do it quickly, it will lead to excessive pressure on brushing and even data loss.

In order to solve the above two problems in MySQL, the following mechanisms are adopted:

When a user thread generates a log, it is first cached in a thread-private variable (mtr), and the log is submitted to the global log cache only when some atomic operations (such as index splitting or merging) are completed. The size of the global cache (innodb_log_file_size) can be configured dynamically. When the thread's transaction is finished, it is determined whether the log needs to be brushed from the buffer to disk according to the current configuration (innodb_flush_log_at_trx_commit).

When the log is successfully copied to the global log buffer, dirty pages that have been modified continue to be added to a global list of dirty pages. This linked list has a feature: it is sorted by the time it was first modified. For example, if there is a data page A, A, B, C, A, B, and C, respectively, page A, B, and C are revised for the first time at 9 a.m., 09:01 and 09:02 respectively, so on this linked list, page An is at the front, B is in the middle, and C is at the end. Even if the data page An is modified again after 9 a. M., it is still ahead of B and C. On the data page, there is a field to record the earliest modified time: oldest_modification, but the unit is not time, but lsn, that is, how many bytes of the log have been written since the database initialization. Because it is an increasing value, it can be understood as a broad sense of time. The lsn corresponding to the log generated by the data written first must be smaller than that written later. The data pages on the list of dirty pages are sorted by oldest_modification from small to large, and when you brush dirty pages, you start where the oldest_modification is small. Checkpoint is the smallest oldest_modification in the list of dirty pages, because this mechanism ensures that changes less than the minimum oldest_modification have been flushed to disk. The most important thing here is the ordering of dirty pages linked lists, assuming that this ordering is broken, if the database exception crash, it will lead to data loss. For example, the oldest_modification of the data page ABC is 120mem100 and 150 respectively, while the order on the list of dirty pages is still A-Magi B-Magi C-An at the front and C at the end. Data page An is flushed to disk, and then checkpoint is updated to 120, but data pages B and C have not been flushed to disk. At this time, database crash, after restarting, scan the log from checkpoint 120, and then restore the data. We will find that the changes to data page C have been restored, but the changes to data page B have been lost.

In the first point, we mentioned the private variable mtr, which stores not only the logs and dirty pages generated by the changes, but also the locks added when the dirty pages are modified. The lock can be released at an appropriate time (for example, when the log is submitted and dirty pages are added to the dirty page list).

Next, we combine the implementation of each version, to analyze the specific implementation details. Note that the following requires a bit of MySQL source foundation, suitable for MySQL kernel developers and experienced DBA.

How to handle MySQL version 5.1

Version 5.1 is an earlier version of MySQL, when InnoDB was a plug-in. Therefore, the design is relatively rough, and the simplified pseudo code is as follows:

The log enters the global cache:

Mutex_enter (log_sys- > mutex) Copy local redo log to global log buffermtr.start_lsn = log_sys- > lsnmtr.end_lsn = log_sys- > lsn + log_len + log_block_head_or_tail_lenincrease global lsn: log_sys- > lsn Log_sys- > buf_freefor every lock in mtr if (lock = = share lock) release share lock directly else if (lock = = exclusive lock) if (lock page is dirty) if (page.oldest_modification = = 0) / This means this page is not in flush list page.oldest_modification = mtr.start_lsn add to flush list / / have one flush list only release exclusive lockmutex_exit (log_sys- > mutex)

The log is written to disk:

Mutex_enter (log_sys- > mutex); log_sys- > write_lsn = log_sys- > lsn;write log to log filemutex_exit (log_sys- > mutex)

Update checkpoint:

Page = get_first_page (flush_list) checkpoint_lsn = page.oldest_modificationwrite checkpoint_lsn to log file

To recover from collapse:

Read checkpoint_lsn from log filestart parse and apply redo log from checkpoint_lsn point

As can be seen from the above pseudo code, because the log entering the global cache is in the critical area, it not only ensures the order of copying the log, but also ensures the order of the dirty page entering the dirty page linked list. When you need to get the checkpoint_lsn, you only need to get the oldest_modification of the first data page from the dirty page linked list. Crash recovery only needs to start scanning from the recorded checkpoint point. In the scenario of high concurrency, many threads need to copy their local logs to the global cache, which will cause lock hotspots. In addition, where the global log is written to the log file, it also needs to be locked, which further leads to lock contention. In addition, the cache (Buffer Pool) of this database has only one dirty page linked list, and the performance is not high. This approach existed in early InnoDB code and was easy to understand, but it is obviously not scalable on today's multi-core systems.

The processing method of version 5.7 of MySQL 5.5GI 5.6GI 5.7

These three versions are the current mainstream MySQL versions, and many branches have made a lot of optimizations on them, but the main processing logic has not changed much:

The log enters the global cache:

Mutex_enter (log_sys- > mutex); copy local redo log to global log buffermtr.start_lsn = log_sys- > lsnmtr.end_lsn = log_sys- > lsn + log_len + log_block_head_or_tail_lenincrease global lsn: log_sys- > lsn, log_sys- > buf_freemutex_enter (log_sys- > log_flush_order_mutex); mutex_exit (log_sys- > mutex) For every page in mtr if (lock = = exclusive lock) if (page is dirty) if (page.oldest_modification = = 0) / / This means this page is not in flush list page.oldest_modification = mtr.start_lsn add to flush list according to its buffer pool instancemutex_exit (log_sys- > log_flush_order_mutex); for every lock in mtr release all lock directly

The log is written to disk:

Mutex_enter (log_sys- > mutex); log_sys- > write_lsn = log_sys- > lsn;write log to log filemutex_exit (log_sys- > mutex)

Update checkpoint:

For ervery flush list: page = get_first_page (curr_flush_list); if current_oldest_modification > page.oldest_modification current_oldest_modification = page.oldest_modificationcheckpoint_lsn = current_oldest_modificationwrite checkpoint_lsn to log file

To recover from collapse:

Read checkpoint_lsn from log filestart parse and apply redo log from checkpoint_lsn point

One of the most important optimizations in the mainstream version is the introduction of another lock, log_sys- > log_flush_order_mutex, in addition to log_sys- > mutex. In the operation of adding dirty pages to the dirty page linked list, we do not need log_sys- > mutex protection, but need log_sys- > log_flush_order_mutex protection, which reduces the critical area of log_sys- > mutex, thus reducing hot spots. In addition, multiple dirty page linked lists are introduced to reduce the conflicts caused by a single linked list. Note that the mainstream branches also make a number of other optimizations, such as:

Introduce dual global log caching. If there is only one global log cache, when the log is cached in the write disk, it will cause subsequent user threads to be unable to copy the log into it until the flush ends. With the dual log cache, one is used to receive logs submitted by the user, and the other can be used to flush the previous logs so that the user thread does not have to wait.

The log automatically expands. If it is found that the current log that needs to be copied is more than half of the global log cache, the global log cache will be automatically doubled. Note that as long as it is expanded, it will not shrink any more.

Log alignment. Early disks were written with 512 atoms, while modern SSD disks were mostly written with 4K atoms. If the write is less than 4K, it will cause the 4K to be read out first, then modified in memory, and then written down, resulting in poor performance. However, with the optimization of log alignment, you can brush the log with the specified size, and fill in 0 if it is not large enough, which can improve the writing efficiency. Here is a pseudo code for writing the optimized log to disk:

Mutex_enter (log_sys- > write_mutex); check if other thead has done write for usmutex_enter (log_sys- > mutex); calculate the range log need to be writeswitch log buffer so that user threads can still copy log during writingmutex_exit (log_sys- > mutex); align log to specified size if neededwrite log to log file log_sys- > write_lsn = log_sys- > lsn;mutex_exit (log_sys- > write_mutex)

You can see that log_sys- > mutex is further reduced. Log_sys- > mutex is no longer allowed to protect the log file in the past. With the above optimizations, MySQL's logging subsystem will not reach a bottleneck in most scenarios. However, the two operations of user threads copying logs to the global log cache and adding dirty page linked lists to dirty pages are still based on the locking mechanism, so it is difficult to exert the performance of multi-core systems.

How to handle MySQL version 8.0

Although the previous version made a lot of optimization, but did not really achieve lock free, in high concurrency, you can see a lot of lock conflicts. As a result, the authorities made great efforts in this area and made a complete change. For details, please refer to last month's monthly report. Here is a brief summary. During the log write phase, reserve space is allocated through the atomic variable, and since atomic variable growth is an atomic operation, there is no locking in this step. After the space is allocated, the log can be copied, and since the space has been reserved in the previous step, multiple threads can copy at the same time without causing the logs to overlap. However, there is no guarantee of the order in which the copies are completed, it is possible to copy them first and then complete them, so there needs to be a mechanism to ensure that the logs before a certain point have been copied to the global log cache. Here, officials have introduced a new lock free data structure, Link_buf, which is an array used to mark the completion of the copy. After each user thread completes the copy, mark it in that array, and then open another thread in the background to calculate whether there are consecutive blocks to complete the copy, and then the logs can be brushed to disk. Officials have also proposed an interesting algorithm for inserting dirty page linked lists into dirty pages, which is also based on the new lock free data structure Link_buf. The basic idea is that the order of the dirty page list can be partially broken, that is, it can be disordered within a certain range, but the whole is still orderly. This unordered program is controlled. Suppose the oldest_modification of the first data page of the dirty page linked list is A. in previous versions, the oldest_modification of the subsequent page of the dirty page linked list is strictly greater than or equal to A, that is, there is no data page older than the first data page. In MySQL 8.0, the oldest_modification of the subsequent page is not strictly greater than or equal to A, can be smaller than A, but must be greater than or equal to A murl, this L can be understood as the degree of disorder, is a fixed value. So the question is, if the list of dirty pages is out of order, how can checkpoint determine, or, after recovery, scan the log from that checkpoint_lsn to ensure that the data is not lost? The official solution is that checkpoint is still determined by the oldest_modification of the first data page in the dirty page linked list, but the crash recovery starts scanning from checkpoint_lsn-L (it is possible that this value is not a boundary of mtr, so it needs to be adjusted). So it can be seen that the official data structure of link_buf cleverly solves the problem of copying local logs to global logs and inserting dirty pages into dirty page linked lists. Because they are all lock free algorithms, the scalability will be better. However, from the point of view of the actual test, it seems that there is no official performance in our test because too many condition variables event are used. We will further analyze the reasons in the future.

The way POLARDB FOR MYSQL is handled

POLARDB as the next generation relational cloud database of Aliyun, we have naturally made a lot of optimizations in the InnoDB log subsystem, which also includes the above areas. Here is a brief introduction to our ideas:

Each buffer pool instance adds an additional read-write lock (rw_locks), which is mainly used to control access to the global log cache. In addition, two collections of dirty page information are introduced, which are referred to here as in-flight set and ready-to-process set. It is mainly used to store dirty page information temporarily.

The log enters the global cache:

Release all share locks holded by this mtr's pageacquire log_buf s-locks for all buf_pool instances for which we have dirty pagesreserver enough space on log_buf via increasing atomit variables / / Just like MySQL 8.0copy local log to global log bufferadd all pages dirtied by this mtr to in-flight setrelease all exclusive locks holded by this mtr's pagerelease log_buf s-locks for all buf_pool instances

The log is written to disk:

Mutex_enter (log_sys- > write_mutex) check if other thead has done write for usmutex_enter (log_sys- > mutex) acquire log_buf x-locks for all buf_pool instancesupdate log_sys- > lsn to newestswitch log buffer so that user threads can still copy log during writingmutex_exit (log_sys- > mutex) release log_buf x-locks for all buf_pool instancesalign log to specified size if neededwrite log to log file log_sys- > write_lsn = log_sys- > lsn;mutex_exit (log_write_mutex)

Brush dirty threads (per buffer pool instance):

Acquire log_buf x-locks for specific buffer pool instancetoggle in-flight set with ready-to-process set. Only this thread will toggle between these two.release log_buf x-locks for specific buffer pool instancefor each page in ready-to-process add page to flush listdo normal flush page operations

Update checkpoint:

For ervery flush list: acquire log_buf x-locks for specific buffer pool instance ready_to_process_lsn = minimum oldest_modification in ready-to-process set flush_list_lsn = get_first_page (curr_flush_list). Oldest_modification min_lsn = min (ready_to_process_lsn, flush_list_lsn) release log_buf x-locks for specific buffer pool instance if current_oldest_modification > min_lsn current_oldest_modification = min_lsncheckpoint_lsn = current_oldest_modificationwrite checkpoint_lsn to log file

To recover from collapse:

Read checkpoint_lsn from log filestart parse and apply redo log from checkpoint_lsn point

In the part where the local log is copied into the global log, similar to the official MySQL 8.0, we first use the atomic growth of atomic variables to allocate space, but MySQL 8.0 uses link_buf to ensure the completion of the copy, while in POLARDB, we use the read-write lock mechanism, that is, we add a read lock before copying, and then release the read lock before copying. Before the log is written to disk, we first try to add a write lock, making use of the mutual exclusion of write lock and read lock. Ensure that all read locks are released when the write lock is acquired, that is, all copy operations are completed. When dirty pages enter the dirty page linked list, the official MySQL allows the dirty page linked list to have a certain degree of disorder (also guaranteed by link_buf), and then ensures data consistency through the mechanism of scanning from checkpoint_lsn-L during recovery. In POLARDB, our solution is to temporarily add dirty pages to a collection, add the dirty page linked list sequentially before brushing the dirty thread to work, and acquire a write lock to ensure that the whole collection is complete before adding the dirty page linked list. In other words, assuming that the smallest oldest_modification of the dirty page set is A, it is guaranteed that the oldest_modification of the dirty pages that are not added to the dirty page set is greater than or equal to A. We do not lock the operation from the dirty page collection to the dirty page linked list, so when we change the row checkpoint, we need to use min (ready_to_process_lsn, flush_list_lsn) as checkpoint_lsn. When recovering from a crash, scan directly from checkpoint_lsn. In addition, we made additional optimizations on POLARDB:

Release page's shared lock ahead of time. If a data page is added with a shared lock, it means that it has not been modified, it is just read, and we can release it in advance, which helps to reduce lock conflicts on hot data pages.

When the log enters the global cache, we do not update log_sys- > lsn in time, but update another variable before the log is written to disk, that is, after acquiring the log_buf write lock, and then update log_sys- > lsn. Mainly to reduce conflicts.

Finally, we tested the performance, and the performance was improved by 10% under non_index_updates 's full memory high concurrency test.

Upstream 5.6.40: 71KMySQL-8.0: 132KPolarDB (master): 162KPolarDB (master + mtr_optimize): 178K "what are the MySQ engine features" content here, thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.