What are the knowledge points of MySQL's InnoDB IO subsystem? 07/06 Update SLTechnology News&Howtos

What are the knowledge points of MySQL's InnoDB IO subsystem?

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article introduces the relevant knowledge of "what are the knowledge points of the InnoDB IO subsystem of MySQL". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Basic knowledge

WAL technology: log first technology, which is used in almost all databases. To put it simply, when the data block needs to be written, the database foreground thread writes the corresponding log to disk first (write in batch order), and then tells the client that the operation is successful, while the real operation of writing data block (discrete random write) is put into the background IO thread. Using this technology, although there is one more disk write operation, because the logs are written in batch order, which is very efficient, so the client can get the corresponding operation very quickly. In addition, if the database crashes before the real data block falls, and the database can be restarted, the database can use logs to recover from the crash without causing data loss.

Data pre-reading: when blocks B and C "adjacent" to block An are read, B and C also have a high probability of being read, so they can be read into memory in advance when reading B. this is data pre-reading technology. The adjacency here has two meanings, one is physical adjacency, the other is logical adjacency. Adjacency in the underlying data file is called physical adjacency. If the data file is not adjacent, but logically adjacent (id=1 data and id=2 data are logically adjacent, but not necessarily physically adjacent, there may be different locations in the same file), it is called logical adjacency.

File opening mode: there are three common modes of Open system calls: OneDIRECT.Oblysync and default mode. O_DIRECT mode means that the subsequent operation of the file does not use the file system cache, and the user mode directly manipulates the device file, bypassing the kernel cache and optimization. from another point of view, if the file is written in O_DIRECT mode, if the return is successful, the data will really fall on the disk (regardless of the cache that comes with the disk). Use O_DIRECT mode to read the file, and each read operation is really read from the disk. It is not read from the cache of the file system. O_SYNC indicates that the operating system cache is used to read and write files through the kernel, but this mode also ensures that the data will fall off the disk each time the data is written. Default mode is similar to O_SYNC mode, except that the data is not guaranteed to fall off the disk after data is written, and the data may still be in the file system. When the host goes down, the data may be lost.

In addition, the write operation not only needs to modify or increase the data off the disk, but also needs the file meta-information to be on the disk. Only when both parts are on the disk, can we ensure that the data will not be lost. The O_DIRECT mode does not guarantee that the file meta-information is dropped (but most file systems do, Bug # 45892), so if nothing else, there is a risk of loss after writing the file in O_DIRECT. O_SYNC ensures that both data and meta-information are closed. Neither of the data in default mode is guaranteed.

After calling the function fsync, you can ensure that the data and logs are off the disk, so you need to call the fsync function when the file opened in O_DIRECT and default mode is finished.

Synchronous IO: this kind of IO is our commonly used read/write function (on Linux). The characteristic is that when the function is executed, the caller will wait for the completion of the function execution, and there is no message notification mechanism, because the function returns, it means that the operation is completed. Later, you can directly check the returned value to know whether the operation is successful. This kind of IO operation is relatively simple to program, and all operations can be completed in the same thread, but the caller needs to wait. In the database system, it is more suitable to call when some data is urgently needed. For example, if the log in WAL must be set up before it is returned to the client, then perform a synchronous IO operation.

Asynchronous IO: in the database, the IO thread that brushes the data block in the background basically uses asynchronous IO. The database foreground thread only needs to submit the block brush request to the queue of the asynchronous IO to do other things, while the background thread IO thread periodically checks whether these submitted requests have been completed, and then do some follow-up processing if it is done. At the same time, asynchronous IO is often submitted in batches of requests, so if different requests access the same file and the offset is continuous, it can be merged into an IO request. For example, if there are * requests to read file 1, 200 bytes of data starting with an offset of 100, and a second request to read file 1, 100 bytes of data starting with an offset of 300, the two requests can be merged into read file 1, 300 bytes of data starting with an offset of 100. Asynchronous IO technology is also often used in logical pre-reading of data pre-reading.

At present, the asynchronous IO library on Linux requires the file to be opened in O_DIRECT mode, and the memory address of the data block, the offset of the file read and write, and the amount of data read and written must be an integral multiple of the file system logic block size, and the file system logic block size can be queried by statements like sudo blockdev-getss / dev/sda5. If the above three are not integer multiples of the logical block size of the file system, an error EINVAL will be reported when calling the read-write function, but if the file is not opened with O_DIRECT, the program can still run, but it will degenerate into synchronous IO and block on the io_submit function call.

InnoDB general IO operation and synchronous IO

In InnoDB, if the system has pread/pwrite functions (os_file_read_func and os_file_write_func), use them for reading and writing, otherwise use the lseek+read/write scheme. This is the InnoDB synchronization IO. Looking at the pread/pwrite document, we can see that these two functions will not change the offset of the file handle and are thread-safe, so it is recommended to use it in a multi-threaded environment, while the lseek+read/write solution needs to use mutex protection on its own. in the case of high concurrency, it frequently falls into the kernel state, which has a certain impact on performance.

In InnoDB, the open system call is used to open the file (os_file_create_func). In addition to O_RDONLY (read-only), O_RDWR (read-write), and O_CREAT (create the file), the mode also uses O_EXCL (ensure that this thread creates the file) and O_TRUNC (empty the file). By default (the database is not set to read-only mode), all files are opened in O_RDWR mode. Innodb_flush_method is an important parameter, so let's focus on:

If O_DSYNC is set in innodb_flush_method, the log file (ib_logfileXXX) is opened using O_SYNC, so you do not need to call the function fsync to flush the data, and the data file (ibd) is opened in default mode, so you need to call fsync to flush the data.

If O_DIRECT is set for innodb_flush_method, the log file (ib_logfileXXX) is opened in default mode. You need to call the fsync function to refresh the disk after writing the data, and the data file (ibd) is opened in O_DIRECT mode. After writing the data, you need to call the fsync function to flush the disk.

If innodb_flush_method is set to fsync or not, both data files and log files are opened in default mode, and you need to use fsync to refresh the disk after writing the data.

If innodb_flush_method is set to O_DIRECT_NO_FSYNC, the file opening method is similar to O_DIRECT mode, except that after the data file is written, the fsync function is not called to refresh the disk, mainly for the file system where O_DIRECT can ensure that the metadata of the file is also stored on the disk.

InnoDB currently does not support opening log files in O_DIRECT mode, nor does it support opening data files in O_SYNC mode.

Note that if you use linux native aio (see the next section for details), innodb_flush_method must be configured as O_DIRECT, otherwise it will degenerate into synchronous IO (there will be no task prompts in the error log).

InnoDB uses file system file locks to ensure that only one process reads and writes to a file (os_file_lock), and uses recommended locks (Advisory locking) instead of mandatory locks (Mandatory locking), because mandatory locks have bug on many systems, including linux. In non-read-only mode, all files are locked with a file lock after they are opened.

Directories in InnoDB are created using recursion (os_file_create_subdirs_if_needed and os_file_create_directory). For example, you need to create the / a/b/c/ directory, first create c, then b, and then a, create the directory to call the mkdir function. In addition, to create the upper layer of the directory, you need to call the os_file_create_simple_func function instead of os_file_create_func, which needs to be noted.

InnoDB also needs temporary files. The logic of creating temporary files is relatively simple (os_file_create_tmpfile), that is, after successfully creating a file in the tmp directory, use the unlink function to release the handle, so that when the process ends (whether it ends normally or abnormally), the file will be automatically released. When InnoDB creates a temporary file, it first reuses the logic of the server layer function mysql_tmpfile, and then calls the dup function to copy a handle because it needs to call the server layer function to release resources.

If you need to get the size of a file, InnoDB does not look up the file's metadata (stat function), but uses lseek (file, 0, SEEK_END) to obtain the file size. The reason for this is to prevent the incorrect file size caused by the delay in the meta-information update.

InnoDB pre-allocates a size to all newly created files (including data and log files), sets the contents of the pre-allocated files to zero (os_file_set_size), and expands when the current file is full. In addition, when the log file is created, the install_db phase, the allocation progress is output in the error log at intervals of 100MB.

In general, regular IO operations and synchronous IO are relatively simple, but in InnoDB, asynchronous IO is used to write data files.

InnoDB Asynchronous IO

Because MySQL was born before Linux native aio, there are two ways to implement asynchronous IO in the code of MySQL asynchronous IO.

* the original Simulated aio,InnoDB simulates an aio mechanism before Linux native air is introduced by import and on some systems that do not support aio. When an asynchronous read-write request is submitted, it is simply put in a queue and then returned, and the program can do something else. In the background, there are several asynchronous io processing threads (controlled by the parameters innobase_read_io_threads and innobase_write_io_threads) constantly fetching requests from this queue, and then using synchronous IO to complete the read and write requests and the work after reading and writing.

The other is Native aio. It is currently done on linux using functions such as io_submit,io_getevents (without glibc aio, which is also simulated). Submit the request using io_submit and wait for the request to use io_getevents. In addition, the window platform also has its own corresponding aio, which will not be introduced here. If you use window's technology stack, the database should choose sqlserver. Currently, other platforms (other than Linux and window) can only use Simulate aio.

First introduce some general functions and structures, then take a detailed look at Native aio on Simulate alo and Linux, respectively.

In os0file.cc, a global array of type os_aio_array_t is defined. These arrays are the queues used by Simulate aio to cache read and write requests. Each element of the array is an os_aio_slot_t type, which records the type of each IO request, the fd of the file, the offset, the amount of data to be read, the time when the IO request was initiated, whether the IO request has been completed, and so on. In addition, the struct iocb in Linux native io is also in os_aio_slot_t. In the array structure os_aio_slot_t, some statistics are recorded, such as how many data elements (os_aio_slot_t) have been used, whether they are empty, whether they are full, and so on. There are five such global arrays, which are used to hold data file read asynchronous request (os_aio_read_array), data file write asynchronous request (os_aio_write_array), log file write asynchronous request (os_aio_log_array), insert buffer write asynchronous request (os_aio_ibuf_array), and data file synchronous read and write request (os_aio_sync_array). Block writes to log files are synchronous IO, but why assign an asynchronous request queue (os_aio_log_array) to log writes here? The reason is that the checkpoint information needs to be recorded in the log header of the InnoDB log file. at present, the reading and writing of checkpoint information is realized by asynchronous IO, because it is not very urgent. In the window platform, if asynchronous IO is used for a specific file, synchronous IO cannot be used for that file, so the data file synchronous read and write request queue (os_aio_sync_array) is introduced. The log file does not need to read the asynchronous request queue, because the log needs to be read only when doing a crash recovery, and when doing a crash recovery, the database is not available, so there is no need for asynchronous read mode at all. One thing to note here is that no matter what the variables innobase_read_io_threads and innobase_write_io_threads are, there is only one os_aio_read_array and os_aio_write_array, but the os_aio_slot_t element in the data increases accordingly. In linux, the variable adds 1, and the number of elements increases by 256. For example, innobase_read_io_threads=4, the os_aio_read_array array is divided into four parts, each with its own independent locks, semaphores, and statistical variables, to simulate four threads, similar to innobase_write_io_threads. From this we can also see that there is a limit of 256 read and write requests that each asynchronous read/write thread can cache. If this number is exceeded, subsequent asynchronous requests need to wait. 256 can be understood as the InnoDB layer's control over the number of asynchronous IO concurrency, while there are length restrictions at the file system layer and disk level, using cat / sys/block/sda/queue/nr_requests and cat / sys/block/sdb/queue/nr_requests queries, respectively.

Os_aio_init is called when InnoDB starts to initialize various structures, including the global array above, as well as locks and mutexes used in Simulate aio. Os_aio_free releases the corresponding structure. The functions of the os_aio_print_XXX series are used to output the state of the aio subsystem, mainly in show engine innodb status statements.

Simulate aio

Compared with Native aio, Simulate aio is relatively complex because InnoDB implements a set of simulation mechanism.

The entry function is os_aio_func. In debug mode, it verifies the parameters, such as the memory address of the data block, the offset of file reading and writing, and the amount of data read and written is an integral multiple of OS_FILE_LOG_BLOCK_SIZE, but does not verify whether O_DIRECT is used in the file opening mode, because Simulate aio ultimately uses synchronous IO, so it is not necessary to use O_DIRECT to open files.

After the verification is passed, os_aio_array_reserve_slot is called to assign the IO request to a background io processing thread (assigned by innobase_xxxx_io_threads, but actually in the same global array), and record the relevant information of the io request to facilitate the processing of the background io thread. If the IO request type is the same, the same file is requested and the offset is close (by default, the offset difference is within 1m), InnoDB allocates the two requests to the same io thread to facilitate IO merging in subsequent steps.

After submitting the IO request, you need to wake up the background io processing thread, because if the background thread detects that there is no IO request, it will enter a waiting state (os_event_wait).

At this point, the function returns, the program can do other things, and the subsequent IO processing is left to the background thread.

Describe how the background IO thread handles it.

When InnoDB starts, the background IO thread is started (io_handler_thread). It will call os_aio_simulated_handle to extract the IO request from the global array, and then process it with synchronous IO. After that, you need to finish the work. For example, if you are writing the request, you need to remove the corresponding data page from the list of dirty pages in buffer pool.

Os_aio_simulated_handle first needs to select an IO request from the array to execute. The selection algorithm is not a simple first-in-first-out algorithm. It selects the request with the smallest offset in all requests to be processed first, so that subsequent IO merging is more convenient for calculation. However, it is also easy to cause some offset isolated requests that have not been executed for a long time, that is, starve to death. In order to solve this problem, InnoDB will do a traversal before selecting IO requests. If any requests are found to have been pushed 2s ago (that is, waiting for 2s), but have not been executed, priority will be given to executing the oldest requests to prevent them from starving to death. If two requests have the same waiting time, they will be executed first. Then select a request with a small offset.

The next thing os_aio_simulated_handle needs to do is to merge the IO. For example, if read request 1 requests the first 200bytes of file1,offset100, and read request 2 requests the first 100bytes of file1,offset300, the two requests can be merged into one request: the first 300bytes of file1,offset100, the IO returns, and then copy the data to the buffer of the original request. Similarly, write requests copy the data that needs to be written to a temporary space before the write operation, and then write it all at once. Note that IO will be merged only if offset is continuous, discontinuity or overlap will not be merged, and identical IO requests will not be merged, so this can be regarded as an optimizable point.

If os_aio_simulated_handle finds that there is no IO request now, it will enter the waiting state, waiting to be awakened.

To sum up, we can see that IO requests are the opposite of push, and each push is processed by a background thread. If the background thread has a high priority, the effect of IO merging may be poor. In order to solve this problem, Simulate aio provides a similar group submission function, that is, a group of IO requests are submitted before awakening the background thread to handle it uniformly, so that the effect of IO merging will be better. However, this is still a small problem, if the background thread is busy, it will not enter the waiting state, that is, as long as the request is queued, it will be processed. This problem can be solved in the Native aio below.

Generally speaking, this set of simulation mechanism implemented by InnoDB is relatively safe and reliable. If the platform does not support Native aio, then use this mechanism to read and write data files.

Linux native aio

If the system has the libaio library installed and innodb_use_native_aio=on is set in the configuration file, Native aio will be used at startup.

The entry function is still os_aio_func. In debug mode, it still checks the passed parameters, and also does not check whether the file is opened in O_DIRECT mode, which is a bit risky. If users do not know that linux native aio needs to open files in O_DIRECT mode to take advantage of aio, then the performance will not be as expected. It is recommended to check here and output any problems to the error log.

After the check passes, like Simulated aio, you call os_aio_array_reserve_slot to assign the IO request to the background thread, and the allocation algorithm takes into account subsequent IO merges, just like Simulated aio. The main difference is that the iocb structure needs to be initialized with the parameters of the IO request. In addition to initializing iocb, the relevant information requested by IO also needs to be recorded in the slot of the global array, mainly for the convenience of statistics in os_aio_print_XXX series functions.

Call io_submit to submit the request.

At this point, the function returns, the program can do other things, and the subsequent IO processing is left to the background thread.

Next comes the background IO thread.

Similar to Simulate aio, background IO threads are started when InnoDB starts. If it is Linux native aio, the function os_aio_linux_handle will be called later. This function is similar to os_aio_simulated_handle, but the underlying implementation is relatively simple, simply calling the io_getevents function to wait for the IO request to complete. The timeout period is 0.5s, that is, if no IO request is completed within 0.5, the function will return and continue to call io_getevents to wait. Of course, before waiting, it will determine whether the server is shut down, and if so, exit.

When distributing IO threads, try to put the adjacent IO in one thread, which is similar to Simulate aio, but the subsequent IO merge operation is implemented by Simulate aio itself, and Native aio is left to the kernel to complete, so the code is relatively simple.

Another difference is that when there is no IO request, Simulate aio goes into a waiting state, while Native aio wakes up every half a second, does some checking, and then waits. Therefore, when a new request comes, Simulated aio requires the user thread to wake up, while Native aio does not. In addition, Simulate aio needs to wake up when the server is down, but Native aio does not.

It can be found that Native aio is similar to Simulate aio in that requests are submitted one by one and then processed one by one, which leads to poor IO merge results. The Facebook team submitted a group submission optimization for Native aio: caching IO requests first, and then calling the io_submit function to submit all previous requests in one breath (io_submit can submit multiple requests at a time). This makes it easier for the kernel to do IO optimization. When Simulate aio is under heavy IO thread pressure, group commit optimization will fail, while Native aio will not. Note that the group submission is optimized so that you cannot submit too many at one go. If the length of the aio waiting queue is exceeded, an io_submit will be forced to be initiated.

This is the end of the content of "what are the knowledge points of the InnoDB IO subsystem of MySQL". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.