How to understand the files of Linux kernel 07/19 Update SLTechnology News&Howtos

How to understand the files of Linux kernel

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article introduces the knowledge of "how to understand the files of the Linux kernel". Many people will encounter such a dilemma in the operation of actual cases, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

The performance development of Linux file pre-reading algorithm disk Istroke O lags far behind that of CPU and memory, so it has become a major bottleneck in modern computer systems. Pre-reading can effectively reduce the seek times of disk and the waiting time of application, and it is one of the important optimization means to improve the performance of disk read Iripple. The author is a doctoral student in the Department of Automation of the University of Science and Technology of China. He began to study Linux in 1998. In order to optimize the performance of the server, he began to try to improve Linux kernel, and finally rewrote the file pre-reading part of the kernel. These improvements were included in Linux Kernel 2.6.23 and subsequent versions.

From register, L1/L2 cache, memory, flash memory to disk / CD / tape / storage network, all levels of memory hardware of the computer form a pyramid structure. The lower the storage capacity is, the larger the storage capacity is. However, the slower the access speed, the smaller the bandwidth and the greater the latency. So this naturally becomes a pyramid-shaped layer-by-layer cache structure. This gives rise to three basic types of cache management and optimization problems:

◆ prefetch (prefetching) algorithm to load data into cache from slow storage

◆ replacement (replacement) algorithm to discard useless data from the cache

The ◆ writeback (writeback) algorithm saves dirty data from the cache to slow storage.

Among them, the prefetching algorithm is particularly important at the disk level. The data positioning and reading mode of the disk manipulator + rotating disk determines its most prominent performance characteristics: it is good at sequential reading and writing, but not good at random I hand O no. I hand O delay is very large. As a result, there are two aspects of pre-reading demand.

Demand from disk

In a nutshell, a typical Iripple O operation of a disk consists of two phases:

1. Data positioning

The average positioning time is mainly composed of two parts: the average seek time and the average rotation delay. The typical value of seek time is 4.6ms. The rotation delay depends on the speed of the disk: the rotation delay of an ordinary 7200RPM desktop hard disk is 4.2ms, while that of a high-end 10000RPM is 3ms. These figures have been hovering for many years and may not be able to improve much in the future. In the following sections, we might as well use 8ms as a typical positioning time.

two。 Data transmission

The continuous transmission rate mainly depends on the rotational speed (linear speed) and storage density of the disk, and the latest typical value is 80MB/s. Although it is difficult to improve the disk speed, the storage density is improving year by year. The adoption of a series of new technologies, such as giant magnetoresistance and perpendicular magnetic recording, not only greatly increases the disk capacity, but also brings a higher continuous transmission rate.

Obviously, the larger the granularity of Icano, the greater the proportion of transfer time in the total time, and therefore the greater the disk utilization and throughput. A simple estimate is shown in Table 1. If you do a large number of random 4KB's, the disk is busy locating more than 99% of the time, and the throughput of a single disk is less than 500KB/s. However, when the size of 1MB is reached, the throughput can be close to 50MB / s. Thus it can be seen that the utilization efficiency and throughput of the disk can be increased by 100 times by using a larger Icano granularity. Therefore, it is necessary to do everything possible to avoid the small size Icano, which is exactly what the pre-reading algorithm does.

Table 1 relationship between random read size and disk performance

Requirements from the program

A typical flow of an application processing data is as follows: while (! done) {read (); compute ();}. Assuming that the loop is repeated five times and a total of five batches of data are processed, the timing diagram for the program to run may look like figure 1.

Fig. 1 typical Istroke O sequence diagram

It is not difficult to see that the disk and CPU are busy alternately: CPU is waiting when disk I get O; when CPU is calculating and processing data, the disk is idle. So is it possible to make both of them pipelined in order to speed up the execution of the program? Pre-reading can help achieve this goal. The basic approach is that when CPU starts processing the first batch of data, the kernel's pre-read mechanism preloads the next batch of data. At this time, the pre-reading is done asynchronously in the background, as shown in figure 2.

Fig. 2 pre-read assembly line assignment

Note that we have not changed the behavior of the application here: the program's next read request is still issued after processing the current data. It's just that the requested data may already be in the kernel cache and can be copied and used without waiting. In this case, the function of asynchronous pre-reading is to "hide" the large latency of disk Imax O from the upper application. Although the delay actually exists, the application is invisible and runs more smoothly.

The concept of pre-reading

The meaning and application of prefetching algorithm are very extensive. It exists at all levels of CPU, hard drives, kernels, applications, and networks. There are two schemes for prefetching: heuristic prefetching and informed prefetching. The former automatically and spontaneously carries on the pre-reading decision, which is transparent to the upper application, but it has the problem of hit rate because of the high requirement of the algorithm; the latter simply provides the API interface, and the upper layer program gives clear pre-reading instructions. At the disk level, Linux provides us with three API interfaces: posix_fadvise (2), readahead (2), and madvise (2).

However, applications that actually use the above pre-read API are rare: in general, heuristic algorithms in the kernel work well. The readahead algorithm predicts the pages to be visited and reads them into the cache in batches in advance.

Its main functions and tasks can be summarized in three keywords:

◆ batch, that is, to aggregate small I _ par O into large I _ An O in order to improve disk utilization and system throughput.

◆ ahead of time, that is, the lash O delay of hiding disks from the application, to speed up the program.

◆ prediction, which is the core task of the pre-reading algorithm. The achievement of the first two functions in Chengdu depends on the ability of accurate prediction. At present, mainstream operating systems, including Linux, FreeBSD and Solaris, follow a simple and effective principle: the read mode is divided into random reading and sequential reading, and only sequential reading is pre-read. This principle is relatively conservative, but it can ensure a high pre-reading hit rate and good efficiency / coverage. Because sequential reading is the simplest and most common, random reading is indeed unpredictable in the kernel.

Pre-read architecture of Linux

A major feature of the Linux kernel is that it supports the most file systems and has a virtual file system (VFS) layer. As early as 2002, during the development of the 2.5 kernel, Andrew Morton introduced the basic framework of file pre-reading in the VFS layer to uniformly support each file system. As shown in the figure, the Linux kernel caches its recently visited file pages in memory for a period of time, and this file cache is called pagecache. As shown in figure 3. The normal read () operation occurs between the buffer provided by the application and the pagecache. The pre-reading algorithm is responsible for populating the pagecache. The read cache of applications is generally small, for example, the read and write granularity of the file copy command cp is 4KB; the kernel pre-read algorithm will pre-read I 128KB O at a size it considers more appropriate, such as 16-read.

Figure 3 pagecache-centric reading and pre-reading

About a year later, Linus Torvalds separately listed the prefetching algorithm for mmap page fault Icando O, thus forming two independent algorithms for read-around/read-ahead (figure 4). Read- around algorithm is suitable for program code and data accessed by mmap, and they have strong locality of reference characteristics. When a page fault event occurs, it centers on the current page and prefetches a total of 128KB pages back and forth. The readahead algorithm is mainly aimed at read () system calls, and they generally have good sequential characteristics. However, there are also a large number of random and atypical read modes, so the readahead algorithm must have good intelligence and adaptability.

Figure 4 read-around, read-ahead and direct read in Linux

A year later, through a lot of work by Steven Pratt, Ram Pai and others, the readahead algorithm was further improved. One of the most important points is the complete support for random reading. Random reading plays a very prominent role in database applications. Prior to this, the pre-reading algorithm takes discrete page reading positions as input, and a random reading of multiple pages will trigger "sequential pre-reading". This leads to an increase in the number of pre-read Ibind O and a decrease in the hit rate. The improved algorithm can better distinguish sequential reading from random reading by monitoring all complete read () calls and getting the page offset and number of read requests at the same time.

Summary of pre-reading algorithm

This section takes linux 2.6.22 as an example to analyze several key points of the pre-reading algorithm.

1. Sequential detection

In order to ensure the pre-read hit rate, Linux only pre-reads sequential reads (sequential read). The kernel determines whether a read () is read sequentially by verifying the following two conditions:

◆ this is the first time to read the file after it is opened, and it is the first part of the file.

◆ 's current read request is contiguous with the location of the previous (recorded) read request in the file.

If the above sequence condition is not satisfied, it is determined to be a random read. Any random read terminates the current sequential sequence, thus terminating the pre-read behavior (rather than reducing the pre-read size). Note that the spatial ordering here refers to the offset within the file, not the continuity of the physical disk sector. Here Linux makes a simplification, which works on the basic premise that files are basically continuously stored on disk without serious fragmentation.

two。 Pipeline pre-reading

When a program is processing a batch of data, we want the kernel to prepare the next batch of data in the background so that CPU and the hard disk can be pipelined. Linux uses two pre-read windows to track the current read-ahead status of the sequence flow: the current window and the ahead window. The ahead window is for pipelining: when the application is working in the current window, the kernel may be doing asynchronous pre-reading in the ahead window; as soon as the program enters the current ahead window, the kernel will immediately push forward two windows and start the pre-read Imao in the new ahead window.

3. The size of pre-reading

When it is determined that you want to sequential readahead, you need to decide the appropriate pre-read size. If the granularity of pre-reading is too small, it will not achieve the desired performance improvement; if there is too much pre-reading, it is possible to load too many pages that the program does not need, resulting in a waste of resources. To this end, Linux adopts a rapid window expansion process:

◆ first pre-reading: readahead_size = read_size * 2; / / or * 4

The initial value of the pre-read window is two to four times the read size. This means that the use of larger read granularity (such as 32KB) in your program can slightly improve the efficiency of Icano.

◆ follow-up pre-reading: readahead_size * = 2

The subsequent pre-read window will be multiplied one by one until the maximum pre-read size set by the system is reached, and the default value is 128KB. This default has been in use for at least five years and is too conservative in the face of faster hard drives and large amounts of memory. For example, the WD Raptor Raptor 10000RPM SATA hard drive launched by Western Digital in recent years can only achieve 16% disk utilization when 128KB is read randomly (figure 5). So if you are running a Linux server or desktop system, try raising the maximum pre-read value to 1MB with the following command. It may be a surprise:

# blockdev-setra 2048 / dev/sda

Of course, the larger the pre-read size, the better. In many cases, the delay problem of I _ map O should be considered at the same time.

Fig. 5 the proportion of data location time and transmission time of 128KB iDUBO

Rediscover sequential reading

In the last section, we solved the basic question of whether / when to pre-read and how much to read. Because of the complexity of reality, the above algorithms do not always work, even in the case of sequential reading. For example, the recently discovered problem of retry reading (retried read).

Retry reads are more common in asynchronous and non-blocking Iplink O. They allow the kernel to interrupt a read request. As a result, subsequent read requests submitted by the program appear to overlap with previously interrupted read requests. As shown in figure 6.

Figure 6 retry reading (retried reads)

Linux 2.6.22 could not understand this situation, so it was misjudged as random reading. The problem here is that the "read request" does not mean that the read operation is actually happening. The decision-making basis of pre-reading should be the latter rather than the former. This is improved in the latest release of 2.6.23. The new algorithm takes the status of the page currently read as the main decision-making basis, and adds a new page marker: PG_readahead, which is a prompt for "Please do asynchronous pre-reading". Each time a new pre-read is made, the algorithm selects one of the new pages and marks it. The pre-reading rules should be changed to:

◆ performs synchronous pre-reading when reading the missing page (missing page)

When the ◆ reads the pre-read page (PG_readahead page), it performs asynchronous pre-reading.

In this way, the ahead pre-read window is not needed: it actually makes an unnecessary binding between the pre-read size and the advance amount. The new marking mechanism allows us to flexibly and accurately control the advance of pre-reading, which will help to introduce support for notebook power-saving mode in the future.

Fig. 7 working dynamics of Linux 2.6.23 pre-reading algorithm

Another increasingly prominent problem comes from interlaced reading (interleaved read). This reading mode is common in multimedia / multithreaded applications. When reading multiple streams (stream) in an open file at the same time, their read requests are intertwined and appear to the kernel to be a lot of random reads. To make matters worse, the current kernel can only track the pre-read status of a stream in an open file descriptor. So even if the kernel pre-reads the two streams, they will overwrite and destroy each other's pre-read state information. In this regard, we will make some improvements in the upcoming 2.6.24, using the status information provided by the page and pagecache to support the interlaced reading of multiple streams.

Pre-reading suggestion

That's all for "how to understand the files of the Linux kernel". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.