How to speed up xtrabackup recovery by 20 times? 07/12 Update SLTechnology News&Howtos

How to speed up xtrabackup recovery by 20 times?

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Brief introduction

Xtrabackup is a free database hot backup software open source by percona that provides non-blocking backups of InnoDB databases and XtraDB storage engine databases with the following advantages:

1) Fast backup and reliable physical backup

2) the backup process does not interrupt ongoing transactions

3) can save disk space and traffic based on functions such as compression

4) automatic backup check

5) Fast reduction

6) can transfer backups to another machine

7) backup data without increasing server load

The principles of Xtrabackup hot backup and recovery are shown in the following figure:

At the beginning of the backup, a background detection process is started to detect changes in mysql redo in real time. As soon as new log writes are found in redo, logs are immediately recorded in the background log file xtrabackup_log. Then copy the data file of innodb and the system tablespace file ibdata1, after the copy is finished, perform the flush tables with read lock operation, copy .frm, MYI,MYD, and other files, and finally issue unlock tables to stop xtrabackup_log.

The recovery phase starts the innodb instance embedded in xtrabackup, plays back the xtrabackup log xtrabackup _ log, applies committed transaction information changes to innodb data / tablespaces, and rolls back uncommitted transactions (a process similar to innodb instance recovery). As shown in the figure:

Xtrabackup's incremental backup process is similar to full backup, dealing only with "increments" in the incremental backup process, mainly because it is still a full backup for myisam and other storage engines compared to innodb.

from the backup recovery process, the backup process is mainly affected by the speed of copy files and log generation, that is, it is related to disk IO, network and system pressure, while the recovery process is mainly related to IO and concurrency control. This article mainly discusses the optimization of Xtrabackup recovery phase.

Current situation

The recovery process of Xtrabackup is actually implemented by calling the recovery logic of embedded innodb (modifying the default values of some parameters, such as the number of buffer pool cache pages during recovery), while the recovery of innodb has always been not so efficient, and the community also has many optimization plans for innodb crash recovery process.

in the actual production environment, frequently T data usually generate dozens of gigabytes or even larger log files when using Xtrabackup for hot backup, which is limited by the backup to restore the configuration of the virtual machine. Such a backup often takes several hours to restore, and the average recovery speed is only 1-4M/s (related to hot data distribution). This speed causes great trouble to the operation and maintenance of existing network instances.

problem

In , the type of memory allocation in the recovery process of InnoDB is MEM_HEAP_BUFFER, that is, a section of memory is opened up in buffer pool to store log records. When the log file to be recovered is very large, there may be insufficient memory. Log processing can be divided into two ways according to whether the memory is sufficient:

1. The memory opened is enough for all to save log records.

in the case of sufficient memory, log parsing and playback is serial, while log playback is parallel, possible participating threads include the main thread and individual IO threads, and in extreme cases there may be log_ checkpoint threads and other worker threads.

2. The memory opened up is not enough to hold all the logging .

in the case of insufficient memory, log parsing requires two rounds. After the first round of parsing to a certain lsn, it is found that there is insufficient memory. Subsequent parsing will abandon saving log record to hash table until all logs are parsed, and finally empty the hash table generated in this round. The legacy of the first round left to the next round of parsing is all the information that needs to be opened on tablespace and all DDL-related information, which is used to restore the tablespace build at the beginning. In the second round, when insufficient memory is found, all the parsed logs are applied to the page. At this time, the merge of ibuf is prohibited (new logs cannot be generated). This requires flushing all dirty pages and invalidating all pages in buffer pool after applying logs, and finally emptying hash table for subsequent log parsing and playback. The remaining logic is the same as 1.

from the actual situation, the overall log recovery speed is slow, an average of 1-4m per second, for hundreds of megabytes of crash recovery and larger backup log recovery, this speed is far from enough.

from the above analysis, the recovery process of log resolution and playback has the following areas that can be optimized:

1. Log parsing

2. Page flush when the log memory is insufficient.

3, parallelism of log parsing and playback

Solution log parsing

There is no log length information in log record. Because the log is usually formatted, it is necessary to use a simple metadata structure to pass it into the processing function in the process of parsing the log file, so as to calculate the boundary of a single log. The recovery process is from last checkpoint lsn to no legal log. This metadata structure is actually dict_index_t and dict_table_t structure. This kind of data structure is needed in the process of log parsing and playback. InnoDB handles them extensively, and each log record parsing and playback requires more than one pair of structures malloc and free.

has raised these two issues in the MySQL community, as well as proposed solutions, such as:

1. Bug#82937. The solution is to add length information to log record header, as shown in the following figure:

In the case of , the boundary of log record can be obtained by relying on length, eliminating the malloc and free of a large number of metadata structures, as well as function calls in parsing log format. This optimization can improve parsing performance by 60%.

2, corresponding (Bug#82176), log record does need a metadata structure during playback, but requires far less information than Runtime. According to analysis, tables with the same number of columns can share this data structure and reinitialize some properties before use, so that unnecessary malloc and free can be reduced by introducing metadata cache. In the product test, cache has a 30% + improvement in single-thread parsing, and the community also has a similar optimization contribution from the Ali team.

However, from the analytical point of view, the single core speed of can reach 60-80M/s before optimization and 120-160M/s after optimization, and the absolute speed is considerable.

is based on the recovery of innodb5.6. Usually, log files are scanned three times, that is, parsed three times. Even if there is the speed of 120M/s, repeated scanning wastes part of the time. If there is a higher requirement for log parsing speed, in order to pursue higher parsing speed, multi-thread parallel parsing can be introduced. The key to parallel parsing is how to effectively split the log into several complete slices.

The feasibility of parallel parsing depends on whether it is possible to divide complete log fragments from successive log files organized by LOG BLOCK. The read-ahead buffer of InnoDB log parsing is RECV_SCAN_SIZE (64K). In fact, it is also read and parsed in stages, but it can handle log records across 64K boundaries through boundary calculation, and all log records across boundaries will be read in the next 64K, which means that the next read log block overlaps with the previous one.

thus, we split the log block according to a fixed size (an integer multiple of LOG BLOCK, such as 10m). The starting position of the first BLOCK of the first part is located by checkpoint lsn, and the first BLOCK starting position of the rest of the part is determined by LOG_BLOCK_FIRST_REC_GROUP. If the log in a part cannot be completely finished, it moves to the next part until the complete log is parsed. The movement of shards may cause two shards to be parsed to the same log record. Since log playback is idempotent, repeated log records will not be affected as long as they are ordered according to lsn. The slicing window of the log file is shown below:

The recovery process after parallel parsing of applications will reduce a lot of parsing time, as shown in the following figure:

Page Flush

In the above analysis of , we found that when the allocated buffer pool is not enough to put down all the log records (most of the large instances will occur), the logs will be parsed many times and then played back in batches. The pages completed in each playback can only trigger page cleaner to be brushed to disk because they cannot execute ibuf merge, and when the hot pages are scattered, the pages involved in each round of playback will far exceed the Xtrabackup default buffer of 512 pages. This leads to the elimination of a large number of single page, and each page needs to call fil_flush (fsync) once, creating a serious performance bottleneck, especially in large instances.

Combined with the hot backup mode of the existing network Xtrabackup, finds that the whole backup recovery process is actually completed as a whole (atomic), and a backup (full or incremental) can only be considered successful after a complete recovery, so that it can be skillfully optimized on page flush, that is, all page flush in the recovery phase will be changed to write-only file cache, instead of calling fli_flush,fsync operations to the operating system for batch scheduling. In other words, it turns synchronous brushing into asynchronous. When the whole recovery is complete, fil_close will brush down all the dirty pages that have not been dropped. Page elimination will no longer be a bottleneck, and the playback speed of each round will be greatly improved.

Parsing and playback parallelism

As shown in the following figure of , the general scheme of parsing and playback of logs in InnoDB is different from serial mode. The parsing process no longer exists independently, but concurrently with playback thread (write new log), IO thread and checkpoint thread. Such concurrency is limited by some existing mechanisms of InnoDB, such as memory management, dirty mechanism, tablespace and checkpoint mechanism. The following will be analyzed one by one:

1, memory management

The memory request type required for the InnoDB recovery phase is MEM_HEAP_BUFFER, which divides a piece of memory from the buffer pool and is of limited size, so there is the two-phase parsing mentioned earlier. Due to the characteristics of MEM_HEAP_BUFFER type, multiple applications and unified release, if parallel with playback, when memory reaches the upper limit, parsing has to stop, wait for all log apply to finish, and then continue parsing after memory recovery.

Log parsing and playback can be understood as producers and consumers, and logs are played back as consumers, and the log records can be recycled after playback. Set the memory type to MEM_HEAP_DYNAMIC, malloc your own memory when each log record is parsed, and release it after playback. Because playback is concurrent, the memory is generally stable.

2. New log generation

InnoDB still manages the logs through log_sys through the recovery phase, and the logs generated by ibuf merge need to be written in the same log file, but in general, the system persistence lsn cannot be obtained until the parsing thread ends the parsing process, so the initial lsn of the new log and the offset written to the log file cannot be determined, so it is usually impossible to generate new logs in the parsing phase.

If does not get persistent lsn at the initial stage of recovery, it will be an obstacle to generating new logs. There are exceptions to the recovery of InnoDB. For example, if InnoDB requires two-phase parsing, the system persistence lsn can already be determined after the end of the first phase; for Xtrabackup, the copied log can determine the end of lsn (that is, the final persistent lsn) at the end of the copy. Therefore, there is no obstacle to generating new logs for Xtrabackup recovery.

Finally, some attributes in the log_sys of the InnoDB recovery phase are also used in the recovery logic, such as buffer, which conflicts with the log writing logic. It is necessary to transfer the conflicting attributes in the log_sys to the recv_sys.

3, brushing mechanism and incremental checkpoint

InnoDB uses flush list to manage dirty pages. Dirty pages are sorted in flush list in the order of lsn when they become dirty for the first time. Whenever dirty pages are brushed, they are removed from flush list. Incremental checkpoint machine scans the smallest lsn in flush list to manage the checkpoint lsn. The lsn that selects dotting must meet the principle that "in flush list, all pages smaller than this lsn are in front of the page to which this lsn belongs". This principle directly depends on the order of the page according to the first dirty lsn.

In the recovery of in InnoDB, the order of pages in flush list is not maintained when the log is parsed, but is determined after the log is played back on a specific page (the page is inserted into flush list after the log is played back). Due to multithreaded playback, the main thread is played back in hash table bucket order, or on demand (read a page), so the order of dirty pages in flush list is not exactly in the order of the first modification. The final state of the flush list is not completely correct until all the pages have played back the log, so in the recovery of InnoDB, log_checkpoint occurs after all the pages have played back the log records.

The parallelism of parsing and playback is bound to generate new logs, but the log buffer and log file size are limited. If there is not enough space for the generation of new logs and log checkpoint cannot be done at this time, then the recovery process may be stuck; parsing and playback dirty pages generated in parallel, if IO allows, persist and promote checkpoint in time to avoid resuming again after exiting abnormally during the recovery process.

The fundamental question of whether can perform checkpoint when parsing logs is how to maintain the order of flush list at all times. The modification order of the page is the order in which it appears in the log, and its order is completely equivalent to the first modification, so you can parse the log whether the peek page is in buffer pool, and if it is not, then load it. You don't have to actually read the page at this time, you only need to occupy a place in the flush list. If there is no load on the page when brushing the dirty page from flush list, then a synchronous IO must occur. In this way, the order of flush list can be maintained throughout the parsing of the logs, thus resolving the limitations of checkpoint in the recovery phase.

4 、 Tablespace

Fil_space information for InnoDB recovery is obtained from logs of type MLOG_FILE_NAME in the log records, because records in the SYS_TABLESPACE system table during the recovery phase may be incomplete, and records of type MLOG_FILE_NAME are written to the log every time tablespace gets dirty or checkpoint for the first time, in order to be able to open all required tablespace during recovery (optimizations introduced by MySQL 5.7.5) The previous version opened all ibd files to load tablespace).

as shown in the following figure, when the last checkpoint occurs when lsn is 1000, the T1 table is still modified after checkpoint, while the MLOG_FILE_NAME log of T1 table is written before MLOG_CHECKPOINT and after the last log of T1. If parsing and playback are concurrent, when the last log of T1 needs to be replayed, its tablespace will not be load if the FIL_NAME log of T1 is not parsed, and there may be problems with replay and subsequent IO at this time. The system table may not have been restored, so it is not feasible to use dict_load at this time.

in addition, if a table is modified after checkpoint and is drop in subsequent operations, as shown in the following figure, the recovery process can ignore the log of the table because it is not necessary and impossible to recover (the physical file has been deleted and there is no tablespace). This process is completed in recv_init_crash_recovery_spaces (), which requires that all the logs are parsed first to generate a complete FIL_ name table. Then coordinate those watches that do not need to be restored.

if parsing and playback are in parallel, Tablespace's Load can be done by scanning all ibd files and load all existing tablespace before parsing the log, or loading all existing tables in the current system table through dict_load, even if the tablespace is incomplete at this time. In addition, for deleted or truncate deleted tables, if fil_space does not exist when the log is played back, or if the page no exceeds the size of tablespace, the relevant log records will not be played back.

Implement

from the above feasibility analysis, parsing and playback in InnoDB can be achieved in theory, but need some key mechanisms of adaptation, involving more content, high complexity, combined with performance benefits, we will choose the actual implementation of the optimization.

In terms of revenue, assumes that the total log parsing time is Xa and the total playback time is (10ml X) a. After the introduction of dict_index cache and parallel parsing of the log, the entire parsing and playback time will be increased by 10 / (10-(1-0.7 shock N) * X) times, where 0.7 is the effect of dict_index cache single thread promotion, and N is the number of concurrent parsing threads, as can be seen from the formula. When the proportion of parsing time is relatively large, increasing the number of concurrent parsing threads can greatly improve the recovery efficiency. If playback time accounts for a large proportion, even if parsing and playback are parsed in parallel, the benefit is very limited.

in summary, in view of the high complexity and limited benefits of parsing and playback parallelism, as well as the proportion of parsing and playback costs, the current recovery optimization schemes are mainly aimed at single-thread parsing optimization and page flushing optimization. The specific implementation schemes are as follows:

1 、 dict_index cache

For each independent parsing thread, adds thread-level cache (to avoid unnecessary lock overhead). The search key of cache is the number of columns, and two tables with the same number of columns share a pair of dict_index and dict_table structures. Some fields on the structure need to be reinitialized before use.

also adds thread-level cache for the main thread, several IO threads, and checkpoint threads that may perform playback tasks, for optimization in the playback phase

2. Control multiple parsing

first changed the mechanism of scanning logs up to three times in innodb 5.6to a maximum of two times, in fact, to eliminate the role of mlog_checkpoint Provide configuration parameters. When a large number of logs are generated by hot backup, skip the first round of log record to join log hash table, making the first round of scanning become a rapid construction of tablespace, which has some effect on hot backup of small instances with high load. Please refer to the size of generated log file and the size of buffer pool.

3, delayed scrubbing

provides configuration parameters to configure the dirty page flushing mode in the recovery phase, and implement asynchronous brushing.

Test environment

development machine: Dev-VD2

Database

database parameters

port=3306

max_connections=100

innodb_buffer_pool_size=4G

innodb_buffer_pool_instances=2

innodb_file_per_table=1

innodb_flush_log_at_trx_commit=0

innodb_log_buffer_size=512M

innodb_log_file_size=1G

test

currently, all log parsing of version 5.6 (before) needs to be done three times by default. Combined with the characteristics of Xtrabackup, two times is enough. This chapter will not compare the tests of parsing three times. (unit ms)

The above tests are all tests on small instances and virtual machines. These optimizations vary in different scenarios, most of which are between 30% and 75%, which are affected by system load, IO, thermal data distribution and other factors. For a 2T instance in the current network, a hot backup generates 20 GB of logs, and the recovery time after optimization is reduced from 4 hours to 10 minutes, and the recovery speed is greatly increased by more than 20 times.

Evolution

can be further optimized for instances in different scenarios, such as parallel parsing mentioned earlier, introducing length information into log format (compatibility needs to be considered), optimizing the construction of tablespace combined with Xtrabackup's own tablespace processing characteristics, etc. The memory management related to logs during InnoDB recovery is relatively extensive, and there is room for optimization. In addition, the locks in the recovery phase, such as recv_sys 's mutex, fli_system 's mutex, flush list's mutex and buffer pool's mutex, all have some room for optimization.

No matter how is optimized, when the instances and logs are very large, the effect of parsing optimization will become smaller and smaller, and it will no longer become a bottleneck in the whole recovery process, and the bottleneck will generally be transferred to IO. Therefore, subsequent optimization needs to be targeted according to the specific analysis of specific scenarios.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.