How to understand Cache and Buffer of Linux 07/06 Update SLTechnology News&Howtos

How to understand Cache and Buffer of Linux

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article is about how to understand Linux Cache and Buffer, the editor feels very practical, so share with you to learn, I hope you can learn something after reading this article, say no more, follow the editor to have a look.

First of all, the cache discussed in this article refers to the page cache,buffer in Linux refers to buffer cache, that is, cache and buffer shown in cat / proc/meminfo.

We know that physical memory will be used up quickly when files or single large files are accessed frequently under Linux. When the program ends, the memory will not be released normally, but will always be occupied as cahce. Therefore, the system often produces OOM because of this, especially in equal pressure scenarios. At this time, it is very high to check the cache and buffer memory for the first time. At present, there is not a good solution to this kind of problem, and most of the problems encountered in the past will be circumvented, so this case tries to give an analysis and solution.

The key to solving this problem is to understand what cache and buffer are, when and where they are consumed, and how to control cache and buffer, so this question focuses on these points. The whole discussion process starts with the analysis of kernel source code as far as possible, then refines the relevant interfaces of APP and verifies the actual operation, and finally summarizes the programming suggestions of the application program.

You can see the buffer and cache of the system through free or cat / proc/meminfo.

Full parsing of free command

1. Cache and Buffer analysis

Starting with cat / proc/meminfo, take a look at the implementation of the interface:

Static int meminfo_proc_show (struct seq_file * m, void * v) {. Cached = global_page_state (NR_FILE_PAGES)-total_swapcache_pages ()-i.bufferram; if (cached)

< 0) cached = 0; …… seq_printf(m, "MemTotal: %8lu kB\n" "MemFree: %8lu kB\n" "Buffers: %8lu kB\n" "Cached: %8lu kB\n" …… , K(i.totalram), K(i.freeram), K(i.bufferram), K(cached), …… ); …… } 其中，内核中以页框为单位，通过宏K转化成以KB为单位输出。这些值是通过si_meminfo来获取的： void si_meminfo(struct sysinfo *val) { val->

Totalram = totalram_pages; val- > sharedram = 0; val- > freeram = global_page_state (NR_FREE_PAGES); val- > bufferram = nr_blockdev_pages (); val- > totalhigh = totalhigh_pages; val- > freehigh = nr_free_highpages (); val- > mem_unit = PAGE_SIZE;}

Where bufferram comes from nr_blockdev_pages (), this function calculates the number of page frames used by block devices, iterates through all block devices, and adds up the number of page frames used. Does not include the number of page frames used by ordinary files.

Long nr_blockdev_pages (void) {struct block_device * bdev; long ret = 0; spin_lock (& bdev_lock); list_for_each_entry (bdev, & all_bdevs, bd_list) {ret + = bdev- > bd_inode- > imaps-> nrpages;} spin_unlock (& bdev_lock); return ret;}

Derive the sources of cache and buffer in meminfo from the above:

Buffer is the number of page frames occupied by the block device

The size of Cache is the total page cache of the kernel minus the number of page frames occupied by swap cache and block devices. In fact, cache is the page cache occupied by ordinary files.

Through kernel code analysis (which skips the complex kernel code analysis here), although the two are not very different in implementation, both are managed through address_space objects, but page cache is the cache of file data and buffer cache is the cache of block device data. For each block device, a def_blk_ops file operation method is assigned, which is the operation method of the device. There will be a radix tree under the inode (inode of the bdev pseudo file system) of each block device, and the page page of cached data will be placed under this radix tree. The number of this page will be shown in the cat / proc/meminfobuffer column. That is, in the absence of a file system, data that operates directly on block devices using tools such as dd will be cached in buffer cache. If the block device makes a file system, then all files in the file system have an inode. This inode will assign operation methods such as ext3_ops. These methods are file system methods. There is also a radix tree under this inode. Here, the page pages of files are also cached. The number of cache pages is counted in the cache column of cat / proc/meminfo. When operating on files, most of the data will be cached to page cache, and not much of the metadata of file system files will be cached to buffer cache.

Here, we use the cp command to copy a 50MB file operation, what happens to memory:

[root nfs_dir] # ll-h file_50MB.bin-rw-rw-r-- 1 4104 4106 50.0M Feb 24 2016 file_50MB.bin [root nfs_dir] # cat / proc/meminfo MemTotal: 90532 kB MemFree: 65696 kB Buffers: 0 kB Cached: 8148 kB. [root@test nfs_dir] # cp file_50MB.bin / [root@test nfs_dir] # cat / proc/meminfo MemTotal: 90532 kB MemFree: 13012 kB Buffers: 0 kB Cached: 60488 kB

You can see that before and after the cp command, MemFree decreased from 65696 kB to 13012 kB,Cached, from 8148 kB to 60488 kB, while Buffers remained the same. So after a while, will Linux automatically release the cache memory used? Looking at proc/meminfo after an hour shows that cache still hasn't changed.

Next, let's look at the memory changes before and after writing to the block device using the dd command:

[0225_19:10:44:10s] [root@test nfs_dir] # cat / proc/meminfo [0225_19:10:44:10s] MemTotal: 90532 kB [0225_19:10:44:10s] MemFree: 58988 kB [0225_19:10:44:10s] Buffers: 0 kB [0225_19:10:44:10s] Cached: 4144 kB. . [022519root@test nfs_dir] # cat / proc/meminfo [022519root@test nfs_dir] 11s] MemTotal: 90532 kB [022519cat 11s] MemFree: 11852 kB [022519cat 11s] Buffers: 36224 kB 17:11s] Cached: 4148 kB. . [0225_19:11:21:11s] [root@test nfs_dir] # cat / proc/meminfo [0225_19:11:21:11s] MemTotal: 90532 kB [0225_19:11:21:11s] MemFree: 11356 kB [0225_19:11:21:11s] Buffers: 36732 kB [0225_19:11:21:11s] Cached: 4148kB. . [0225 519 proc/meminfo 11s] [root@test nfs_dir] # cat / proc/meminfo [0225 519V 11s] MemTotal: 90532 kB [022519 proc/meminfo 11s] MemFree: 11864 kB [02251911s] Buffers: 36264 kB [022519proc/meminfo 11s] Cached: 4148 kB... .. ……

Before the naked write block device, the Buffs is 0. In the process of naked writing hard disk, it is found that the Buffers has been increasing, the free memory is getting less and less, and the number of Cached has remained the same.

Summary:

Through code analysis and practical operation, we understand that both buffer cache and page cache take up memory, but we also see the difference between them. The cache,buffer of page cache for files is the cache for block device data. Linux does not actively release page cache and buffer cache when there is plenty of memory available.

two。 Use posix_fadvise to control Cache

In Linux, files are generally read and written through buffer io, in order to make full use of page cache.

The characteristic of Buffer IO is that when reading, first check whether there is any needed data in the page cache, if not, read it from the device, return it to the user, and add a copy to the cache; when writing, write directly to the cache, and then brush it to the disk regularly by the background process. Such a mechanism looks very good, but it can actually improve the efficiency of reading and writing documents.

But when the IO of the system is dense, there will be problems. When the system writes too much, exceeding a certain limit of memory, the write-back thread in the background will come out to reclaim the page, but once the speed of recycling is less than the speed of writing, OOM will be triggered. The most important thing is that the whole process is participated by the kernel and is difficult for users to control.

So how can we effectively control cache?

At present, there are two main ways to avoid risk:

Go, direct io.

Go to buffer io, but remove useless page cache regularly

What is discussed here, of course, is the second way, that is, how to effectively control page cache in buffer io mode.

As long as you know the handle to the file in the program, you can use:

Int posix_fadvise (int fd, off_t offset, off_t len, int advice)

POSIX_FADV_DONTNEED (the file will not be accessed later), but there have been feedback from developers who doubt the validity of the interface. So is the interface really valid? First, let's look at the mm/fadvise.c kernel code to see how posix_fadvise is implemented:

/ * POSIX_FADV_WILLNEED could set PG_Referenced, and POSIX_FADV_NOREUSE could * deactivate the pages and clear PG_Referenced. * / SYSCALL_DEFINE4 (fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice) {… ... ... ... / * = > swap the data within the specified range from the page cache * / case POSIX_FADV_DONTNEED: / * = > if the backup device is not busy, first call _ _ filemap_fdatawrite_range to brush off the dirty page * / if (! bdi_write_congested (mapping- > backing_dev_info)) / * = > WB_SYNC_NONE: not synchronously waiting for the page to be refreshed Only * / / * = > is submitted, and fsync and fdatasync are returned with WB_SYNC_ALL parameter * / _ _ filemap_fdatawrite_range (mapping, offset, endbyte, WB_SYNC_NONE). / * First and last FULL page! * / start_index = (offset+ (PAGE_CACHE_SIZE-1)) > > PAGE_CACHE_SHIFT; end_index = (endbyte > > PAGE_CACHE_SHIFT); / * = > next clear the page cache * / if (end_index > = start_index) {unsigned long count = invalidate_mapping_pages (mapping, start_index, end_index) / * * If fewer pages were invalidated than expected then * it is possible that some of the pages were on * a per-cpu pagevec for a remote CPU. Drain all * pagevecs and try again. * / if (count

< (end_index - start_index + 1)) { lru_add_drain_all(); invalidate_mapping_pages(mapping, start_index, end_index); } } break; … … … … } 我们可以看到如果后台系统不忙的话，会先调用__filemap_fdatawrite_range把脏页面刷掉，刷页面用的参数是是 WB_SYNC_NONE，也就是说不是同步等待页面刷新完成，提交完写脏页后立即返回了。然后再调invalidate_mapping_pages清除页面，回收内存： /* =>

< 0) { return -1; } if (ioctl(fd, BLKFLSBUF, 0)) { printf("ioctl cmd BLKFLSBUF failed, errno:%d\n", errno); } close(fd); printf("ioctl cmd BLKFLSBUF ok!\n"); return 0; } 综上，使用块设备命令BLKFLSBUF能有效的清除块设备上的所有buffer，且清除后的buffer能立即被释放变为可用内存。利用这一点，联系后端业务场景，给出应用程序编程建议：每次关闭一个块设备文件描述符前，必须要调用BLKFLSBUF命令，确保buffer中的脏数据及时刷入块设备，避免意外断电导致数据丢失，同时也起到及时释放回收buffer的目的。当操作一个较大的块设备时，必要时可以调用BLKFLSBUF命令。怎样算较大的块设备?一般理解为当前Linux系统可用的物理内存小于操作的块设备大小。 5. 使用drop_caches控制Cache和Buffer /proc是一个虚拟文件系统,我们可以通过对它的读写操作作为与kernel实体间进行通信的一种手段.也就是说可以通过修改/proc中的文件来对当前kernel的行为做出调整。关于Cache和Buffer的控制，我们可以通过echo 1 >

/ proc/sys/vm/drop_caches to do the operation.

First, let's take a look at the kernel source code implementation:

Int drop_caches_sysctl_handler (ctl_table * table, int write, void _ _ user * buffer, size_t * length, loff_t * ppos) {int ret; ret = proc_dointvec_minmax (table, write, buffer, length, ppos); if (ret) return ret If (write) {/ * = > echo 1 > / proc/sys/vm/drop_caches cleans up page cache * / if (sysctl_drop_caches & 1) / * = > traverses all super blocks, cleans up all caches * / iterate_supers (drop_pagecache_sb, NULL); if (sysctl_drop_caches & 2) drop_slab ();} return 0 } / * * iterate_supers-call function for all active superblocks * @ f: function to call * @ arg: argument to pass to it * * Scans the superblock list and calls given function, passing it * locked superblock and given argument. * / void iterate_supers (void (* f) (struct super_block *, void *), void * arg) {struct super_block * sb, * p = NULL; spin_lock (& sb_lock); list_for_each_entry (sb, & super_blocks, s_list) {if (& sb- > s_instances) continue; sb- > scountdown; spin_unlock (& sb_lock); down_read (& sb- > s_umount) If (sb- > s_root & (sb- > s_flags & MS_BORN)) f (sb, arg); up_read (& sb- > s_umount); spin_lock (& sb_lock); if (p) _ put_super (p); p = sb;} if (p) _ put_super (p); spin_unlock (& sb_lock) } / * = > Clean the page cache of file systems (including bdev pseudo file systems) * / static void drop_pagecache_sb (struct super_block * sb, void * unused) {struct inode * inode, * toput_inode = NULL; spin_lock (& inode_sb_list_lock) / * = > traverse all inode * / list_for_each_entry (inode, & sb- > s_inodes, i_sb_list) {spin_lock (& inode- > i_lock) / * * = > if the current state is (I_FREEING | I_WILL_FREE | I_NEW) or * = > if there is no cache page * = > skip * / if ((inode- > i_state & (I_FREEING | I_WILL_FREE | I_NEW)) | | (inode- > imapping-> nrpages = = 0) {spin_unlock (& inode- > i_lock); continue;} _ iget (inode); spin_unlock (& inode- > i_lock) Spin_unlock (& inode_sb_list_lock); / * = > clear cache pages (except those that are dirty, locked, being written back or mapped in the page table) * / invalidate_mapping_pages (inode- > i_mapping, 0,-1); iput (toput_inode); toput_inode = inode; spin_lock (& inode_sb_list_lock);} spin_unlock (& inode_sb_list_lock); iput (toput_inode);}

In summary, echo 1 > / proc/sys/vm/drop_caches clears all cache pages for inode, where the inode includes VFS's inode, all file system inode (including inode cache pages for bdev pseudo-file system block devices). So after the command is executed, the entire system's page cache and buffer cache are cleared, provided, of course, that the cache is non-dirty and not in use.

Let's take a look at the actual effect:

[root@test usb] # cat / proc/meminfo MemTotal: 90516 kB MemFree: 12396 kB Buffers: 96 kB Cached: 60756 kB [root@test usb] # busybox_v400 sync [root@test usb] # echo 1 > / proc/sys/vm/drop_caches [root@test usb] # cat / proc/meminfo MemTotal: 90516 kB MemFree: 68820 kB Buffers: 12 kB Cached: 4464 kB

You can see that both Buffers and Cached are down, and it is recommended that you execute the sync command before drop_caches to ensure the integrity of the data. The sync command writes all unwritten system buffers to disk, including the modified i-node, deferred block IWeiO, read-write mapping file, and so on.

The above settings are simple but rough, so that the role of cache is basically unable to play, especially when the system pressure is relatively high, it is easy to cause problems when dealing with drop cache. Because drop_cache is clearing memory globally, page locks will be added in the process of cleaning, resulting in some processes and other page locks timeout, resulting in problems. Therefore, it is necessary to make appropriate adjustments according to the status of the system to find the best solution.

6. Experience summary

Where did Cache and Buffer come from? When and where do you spend it? How to control Cache and Buffer respectively. Finally, the use of vmtouch tools is introduced.

An in-depth understanding of Linux's Cache and Buffer involves a large number of kernel core mechanisms (VFS, memory management, block device drivers, page caching, file access, page frame writeback) and needs to be understood and digested in subsequent work.

The above is how to understand Linux's Cache and Buffer. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.