Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to understand the data consistency and io type of linux

2025-04-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

This article mainly explains "linux data consistency and io type understanding". The content of the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "how to understand linux data consistency and io type".

For the linux kernel, reading and writing have to go through layers of paths before you can really read and write to the hard disk. In terms of the io path, the io has to go through the page cache,io scheduling queue, the dispatch queue, the ncq queue and the hard disk cache to really reach the hard disk.

Page cache:page cache is the caching interface provided by the linux kernel, and the name page cache indicates that the kernel manages cache through page units (usually 4K in size). The read operation is first looked up in page cache, and if it is found, the content of page cache is copied and returned, and the lower-level processing is really called only if it is not found. Write operation. Buffer io writes to page cache and returns. The real disk writing is the responsibility of the kernel's pdflush kernel thread.

IO scheduling queue:

The Linux kernel provides four io scheduling algorithms, as,deadline,cfq,noop. Each scheduling algorithm implements a scheduling queue. Io first sorts in the queue (noop is the simplest and does not sort), and then decides whether to go to the dispatch queue according to the conditions. Dispatching from the scheduling queue involves the concept of a unplug. In other words, the scheduling queue is usually in a plug state, and when the unplug operation is performed, the io leaves the scheduling queue and starts sending. Unplug is a circular action that attempts to dispatch all io of the scheduling queue until it cannot be dispatched.

To sum up, the following conditions apply to unplug:

The first io starts a timer of three milliseconds. When the timer arrives, it will unplug and start sending.

If the io request exceeds the set limit (default is 4), execute unplug and start sending.

The io of the Sync flag, unplug immediately, starts to be sent.

The io of the Barrier flag. After clearing the scheduling queue, execute unplug and start sending.

After the execution of an io, the unplug queue is also required.

Dispatch queues: dispatch queues are not relevant to the application. But the kernel layer provides a barrier io for the joural data of the journaling file system, which is mainly implemented in the dispatch queue.

Ncq queue:

NCQ is the queue of the sata hard drive itself. (the queue for sas hard drives is called TCQ). The NCQ queue is created by the operating system, but the io added to the NCQ queue is determined by the hard disk. To achieve this, the NCQ queue is created in the DMA memory of the kernel and then informs the hard disk that which io execution the hard disk chooses is the result of the hard disk's own choice.

Hard disk cache:

The hard disk cache is the cache inside the hard disk. If you open the hard disk cache, write the hard disk io, first to the hard disk cache, rather than directly to the hard disk.

one。 Writeback Logic of Pdflush

Pdflush provides four parameters to control writeback. In the kernel implementation, the writeback policy control of pdflush is still complicated.

But to put it simply, by default, the kernel scans dirty pages every 5 seconds, and if the dirty pages live for more than 30 seconds (the default), the dirty pages are brushed to disk.

For details, please refer to the article "linux kernel writeback mechanism and adjustment" written by me.

two。 Data downloading and consistency analysis

From the above analysis, the usual io write ends at the page cache layer and returns, not really to the hard disk. In this way, when the machine loses power or breaks down, there is a risk of losing data. In order to io as soon as possible, the system also provides some measures to solve this problem.

O_SYNC: when opening the file, you can set the O_SYNC flag. After the writing of the page cache is completed, if the file has the O_SYNC flag, immediately start sending the io and enter the scheduling queue. Then the meta data data of the file system is also sent, and then the unplug operation starts to loop until all the write io is complete. Compared with the write-back mechanism, O_SYNC did not wait for the dirty page to survive for 30 seconds, so it tried to send it to the hard disk immediately.

The essence of O_SYNC is to issue io and then perform a unplug operation. Several problems with O_SYNC are:

When writing page cache, split the io into 4k units. Writeback is also a page that writes 4K each time. If it is a large io, the scheduling layer of the kernel needs to recombine the 4k io. It's a redundant process.

Each io needs to be unplug immediately, so that sorting and merging of io cannot be achieved. The performance of O_SYNC is quite low.

If multiple processes write at the same time, the order of write operations cannot be guaranteed. The Ncq queue determines the order of execution based on the position of the hard disk head and the disk rotation position. Meta data data is usually written together, so there is a risk of being out of sync.

If the hard disk cache is turned on, then writing only to the hard disk cache will be returned. There is a risk of data loss. Usually the storage manufacturer requires the hard disk cache to be turned off. However, all of Tencent's servers open hard disk cache.

O_DIRECT: when opening a file, you can set the O_DIRECT flag. O_DIRECT does not use the page cache provided by the kernel. In this way, the read operation does not go to the page cache to check for the existence of the required data. On the other hand, the write operation does not write the data to the page cache, but sends it to the scheduling queue.

When O_DIRECT executes to write io, it sets the WRITE_SYNC flag. This flag performs a unplug operation after io enters the scheduling queue. Instead of performing unplug operations in a loop, as O_SYNC does.

In order to avoid the problem that O_SYNC has to block waiting for each write io, the system provides fsync and fdatasync system calls, which allow the application layer to control the timing of synchronization.

Fsync:fsync sends all the dirty pages within the file range to the hard drive. Then write the dirty metadata to the hard drive as well. If the inode of the file itself changes, it also needs to be written to the hard drive.

The difference between Fdatasync:fdatasync and fsync is actually very slight. For example, in the ext2 file system, if the inode of the file changes only slightly, fdatasync does not update the inode at this time. A typical slight change is a change in the file atime. In the ext3 file system, fsync and fdatasync are exactly the same. Write back inode whether it changes slightly or not.

Both Fsync and fdatasync operate on the entire file, and if the application only wants to refresh the specified location of the file, these two system calls will be invalidated. So the new kernel also provides sync_file_range to specify the scope of writing. Note, however, that sync_file_range is a meta data that does not write back files. The application layer must ensure that the meta data is not updated.

three。 Writeback Logic of Pdflush

Pdflush provides four parameters to control writeback. In the kernel implementation, the writeback policy control of pdflush is still complicated.

But to put it simply, by default, the kernel scans dirty pages every 5 seconds, and if the dirty pages live for more than 30 seconds (the default), the dirty pages are brushed to disk.

four。 Barrier io of the kernel

From the above analysis, it can be seen that the kernel does not provide a guaranteed order for the user state to determine the system calls written to the hard disk. But for kernel file systems, such an interface must be provided. For example, the log file system must have the data on the hard disk before the log of the metadata can be modified. Otherwise, an error may cause the file system to crash. To this end, the kernel specifically provides a barrier way to achieve accurate log writing to the hard disk.

The barrier io of the file system means that writing io before this barrier io must be completed. At the same time, there can be no other write io execution until the barrier io is completed (it is really written to the hard disk, not returned as soon as it is written to cache). To this end, the dispatch queue analyzed above completes this function.

When writing io from the scheduling queue to the dispatch queue, check to see if it is a barrier io. If barrier io,dispatch first inserts a SCSI command SYNCHRONIZE_CACHE into the queue, this command instructs the hard disk cache to brush all write io to the hard disk. Then issue barrier io, and then insert a SYNCHRONIZE_CACHE command to instruct the hard disk to actually write the barrier io to the hard disk (another way is to use the command to carry the FUA flag to get off the disk without going through cache).

Thank you for your reading, the above is the content of "linux data consistency and io type understanding". After the study of this article, I believe you have a deeper understanding of linux data consistency and io type understanding, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report