How to understand the disk caching mechanism under Linux and the write magnification of SSD 07/12 Update SLTechnology News&Howtos

How to understand the disk caching mechanism under Linux and the write magnification of SSD

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article introduces the knowledge of "how to understand the disk caching mechanism under Linux and the write magnification of SSD". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Some time ago, in the development of a system that uses SSD as cache, there will be a large number of disk caches when writing data at high speed. If too many disk caches are not written to the disk in time, it is very dangerous when there is a problem with the machine, which will lead to a lot of data loss, but if the data is brushed into the disk in real time, the writing efficiency is too low. In order to understand this disk writing feature of the Linux system, I have recently taken a closer look at it.

The existence of VFS (Virtual File System) makes Linux compatible with different file systems, such as ext3, ext4, xfs, ntfs and so on. It not only implements a common external interface for all file systems, but also has another important role related to system performance-caching. The mechanism of high-speed disk cache is introduced into VFS, which is a software mechanism that allows the kernel to save some information that originally exists on the disk in RAM, so that further access to these data can be carried out quickly without having to access the disk itself slowly. Disk caches can be roughly divided into the following three categories:

Directory item cache-mainly stores directory item objects that describe file system pathnames

Inode cache-mainly stores Inode objects that describe disk Inode

Page cache-the main storage is the complete data page object, the data contained in each page must belong to a file, at the same time, all file read and write operations depend on the page cache. It is the main disk cache used by the Linux kernel.

It is precisely because of the introduction of cache that the VFS file system adopts the technology of delayed writing of file data. therefore, if the synchronous write mode is not used when calling the system interface to write data, then most of the data will be saved in the cache first and the data will not be brushed to disk until certain conditions are met.

How does the kernel flush data to disk? You can get the answer after reading the following two points.

1. Write dirty pages to disk

As we know, the kernel constantly fills the page cache with pages that contain block device data. As long as the process modifies the data, the corresponding page is marked as dirty, that is, the location of its PG_dirty flag.

The Unix system allows the operation of writing dirty buffers to block devices to be delayed because this strategy can significantly improve system performance. A few writes to a page in the cache may require only a slow physical update to the corresponding disk block. In addition, the write operation is not as urgent as the read operation, because the process is usually not suspended because of a delay in writing, and in most cases it is suspended because of a delay in reading. Because of the delay in writing, on average, any physical block device will provide more services for read requests than write requests.

A dirty page may stay in main memory until the last minute (that is, until the system shuts down). However, judging from the limitations of the deferred write strategy, it has two main disadvantages:

First, if there is a hardware error or a power outage, you will no longer be able to get the contents of the RAM, so many changes made to the files since the system startup have been lost.

Second, the size of the page cache (the size of the RAM needed to hold it) can be large-at least different from the size of the block device being accessed.

Therefore, refresh (write) dirty pages to disk under the following conditions:

The page cache has become too full, but more pages are needed, or there are already too many dirty pages.

It has been too long since the page became dirty.

The process requests that any pending changes to the block device or specific file be refreshed. This is done by calling the sync (), fsync (), or fdatasync () system call.

The introduction of buffer pages makes the problem more complex. The buffer header associated with each buffer page enables the kernel to know the status of each individual block buffer. If the PG_Dirty flag of at least one buffer header is set, the PG_dirty flag of the corresponding buffer page should be set. When the kernel selects a buffer to flush, it scans the corresponding buffer header and effectively writes only the contents of the dirty blocks to disk. Once the kernel flushes all dirty pages of the buffer to disk, the PG_dirty flag of the page is cleared by 0.

2. Pdflush kernel thread

Earlier versions of Linux used bdfllush kernel threads to systematically scan the page cache for dirty pages to be refreshed, and another kernel thread, kupdate, to ensure that all pages were not "dirty" for too long. Linux 2.6 replaces the above two threads with a set of general kernel threads pdflush.

The structure of these kernel threads is flexible, and they act on two parameters: a pointer to the function to be executed by the thread and a parameter to be used by the function. The number of pdflush kernel threads in the system needs to be dynamically adjusted: pdflush threads are created when there are too few, and killed when there are too many. Because the functions executed by these kernel threads can be blocked, creating multiple pdflush kernel threads instead of one can improve system performance.

Control the generation and demise of pdflush threads according to the following principles:

There must be at least two and a maximum of eight pdflush kernel threads

If there is no idle pdflush during the last 1s, you should create a new pdflush thread

If the last pdflush has been idle for more than 1 second, you should delete a pdflush thread

All pdflush kernel threads have pdflush_work descriptors and their data structures are as follows:

When the system has no dirty pages to refresh, the pdflush thread automatically sleeps and is finally woken up by the pdflush_operation () function. So what does the pdflush kernel thread do during this process? Some of this work is related to the refresh of dirty data. In particular, pdflush usually executes one of the following callback functions:

1. Background_writeout (): systematically scans the page cache for dirty pages to be refreshed.

In order to get dirty pages that need to be refreshed, thoroughly search all address_space objects (which is a search tree) corresponding to the index node that has an image on disk. Because the page cache may have a large number of pages, scanning the entire cache with a single execution stream will keep CPU and disk busy for a long time, so Linux uses a complex mechanism to divide the scan of the page cache into several execution streams. The wakeup_bdflush () function is executed when memory is low or when the user explicitly (the user-mode process issues a sync () system call, etc.) invokes the refresh operation. The wakeup_bdflush () function calls pdflush_operation () to wake up the pdflush kernel thread and delegates it to execute the callback function background_writeout (). The background_writeout () function effectively fetches a specified number of dirty pages from the page cache and writes it back to disk. In addition, the pdflush kernel thread executing the background_writeout () function can be awakened only if the page content in the page cache is modified or the dirty page increases to more than a certain dirty background threshold. The background threshold is usually set to 10% of all pages in the system, but it can be adjusted by modifying the file / proc/sys/vm/dirty_background_ratio.

2. Wb_kupdate (): check whether there are pages in the page cache that have been "dirty" for a long time to avoid the risk of hunger when some pages have not been refreshed for a long time.

During initialization, the kernel sets up a wb_timer dynamic timer with an interval of a few hundredths of a second (usually one-500th of a second) specified in the dirty_writeback_centisecs file, but this value can be adjusted by modifying the / proc/sys/vm/dirty_writeback_centisecs file. The timer function calls the pdflush_operation () function and passes in the address of the wb_kupdate () function. The wb_kupdate () function traverses the page cache to search for obsolete dirty index nodes, writes to disk pages that have been dirty for more than 30 seconds, and then resets the timer.

PS: on the problem of write magnification of SSD

Now solid state drives are increasingly being used as server disks. There were some problems when designing and implementing a cache system on SSD (Solid State Drive) to store data blocks. For example, after the disk is full, if you age out some of the most unused data blocks, continue to write a large number of new data, and as time goes by, the write speed becomes much slower than at the beginning. In order to find out why this happened, I searched some information about SSD on the Internet. it turned out that this situation was determined by the hardware design of SSD itself, and finally mapped to the application. This phenomenon is called write magnification (WA: Write amplification). WA is a very important attribute related to flash memory and SSD. The term was first proposed by Intel and SiliconSystems in 2008 (acquired by Western Digital in 2009) and used in public contributions. The following is a brief explanation of why this happens and what the process is.

The design of SSD is completely different from the traditional mechanical disk. It is a complete electronic device without the read-write head of the traditional mechanical disk. Therefore, SSD can provide higher IOPS performance due to the lack of the seek process of magnetic head between tracks when reading and writing data. Because of the lack of head scheduling, SSD can also reduce the use of electricity, which is very beneficial for enterprises to use in the data center.

Compared with the traditional disk, SSD has great performance advantages and more advantages, but things always have two sides, and it also has some problems. The data written in SSD can not be updated directly, but can only be rewritten through sector overwrite. It needs to be erased before overwriting, and the erase operation can only be done on the block of the disk. Before erasing the block, the original valid data needs to be read out first, and then written with the new data. These repeated operations will not only increase the amount of data written, but also reduce the life of the flash memory. Eat up the available bandwidth of flash memory and indirectly affect the performance of random writes.

The solution of writing magnification

In practice, it is difficult for us to completely solve the problem of SSD write amplification, and we can only use some methods to reduce the magnification more effectively. A very simple way is to use only part of the capacity of a large SSD hard drive, such as 128GB. If you only use 64GB, then the worst-case scenario is that write magnification can be reduced by about 3 times. Of course, this method is a bit of a waste of resources. In addition, you can write data sequentially, and when SSD is written sequentially, the write magnification is generally 1, but some factors affect this value.

In addition to the above method, it is recognized that the better method at this stage is TRIM. TRIM is located at the operating system layer. The operating system uses the TRIM command to inform SSD that data from a page is no longer needed and can be recycled. The main difference between the operating systems that support TRIM and the past is that the operation of deleting a Page is different. During the disk period, after deleting a page, the flag bit of the page is set to available in the record information of the file system, but the data is not deleted. When an operating system that uses SSD and supports TRIM deletes a page, it will also notify SSD that the data of the page is not needed. There is a garbage collection process in the SSD in the idle time. In the idle time, SSD will gather some idle data together and then Erase together. In this way, each time a write operation is performed, new data is written on the already Erase-ready Page.

Although it has the problem of writing magnification, it doesn't make us refuse to use it. Using it for cache acceleration has been used in many projects, especially in database caching projects, where the efficient read performance of SSD has been fully utilized. With the release of Flash Cache, the open source project of Facebook, and the extensive use within Facebook, Flash Cache has become a more mature technical solution, which makes more companies choose SSD for storage or caching.

This is the end of the content of "how to understand the disk caching mechanism under Linux and the write magnification of SSD". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.