How to understand SWAP of Linux system 07/12 Update SLTechnology News&Howtos

How to understand SWAP of Linux system

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article introduces you how to understand the Linux system SWAP, the content is very detailed, interested friends can refer to, hope to be helpful to you.

1. What is SWAP and what does it do? By swap, we generally refer to an swap partition or file. On Linux, you can use the swapon-s command to see what swap space is in use on the current system, as well as related information:

[zorro@zorrozou-pc0 linux-4.4] $swapon-s Filename Type Size Used Priority / dev/dm-4 partition 33554428 0-1 functionally, swap partitions mainly swap some of the data on memory to swap space when there is insufficient memory, so that the system will not run out of memory and lead to oom or more deadly situations.

Therefore, when there is pressure on memory usage and starts to trigger memory collection, swap space may be used.

The kernel's use of swap is actually closely tied to memory collection behavior. So with regard to the relationship between memory recovery and swap, we need to think about the following questions:

Why do you want to reclaim memory?

What kind of memory might be reclaimed?

When will the exchange take place in the process of recycling?

How to exchange it exactly?

Let's start with these problems and analyze them one by one.

Why do you want to reclaim memory? There are two main reasons for memory recycling in the kernel:

The kernel needs to provide enough memory for sudden memory requests at any time. So in general, it is necessary for the kernel to ensure that there is enough free space.

In addition, although the strategy of using cache in the Linux kernel is not in vain, the kernel will use page cache in memory to cache some files in order to improve the efficiency of reading and writing files.

Therefore, it is necessary for the kernel to design a mechanism to recover memory periodically, so that the use of cache and other related memory will not keep the remaining memory of the system in a small state for a long time.

When there is a request that is larger than the free memory, it will trigger a forced memory collection.

Therefore, in response to the needs of these two types of recycling, the kernel implements two different mechanisms: one is to use the kswapd process to periodically check the memory to ensure that the remaining memory is available as much as possible.

The other is direct memory recycling (directpagereclaim), which triggers direct memory recycling when there is no free memory to meet the requirements when memory is allocated.

The trigger paths for these two kinds of memory reclamation are different:

One is that the kernel process kswapd directly calls the logic of memory collection to reclaim memory.

See kswapd () main logic in mm/vmscan.c

The other is to reclaim the memory request logic of slow path when applying for memory.

See the _ _ alloc_pages_slowpath method in mm/page_alloc.c in kernel code

The actual process of memory collection in both methods is the same, and the final result is to call the shrink_zone () method for memory page reduction for each zone.

In this method, the shrink_lruvec () method is called again to check the linked list process of each organization page. After finding this clue, we can clearly see which page the memory reclaim operation is aimed at.

These linked lists are mainly defined in an enum of mm/vmscan.c:

According to this enum, you can see that there are four linked lists that need to be scanned for memory recovery:

Inactive of anon

Active of anon

Inactive of file

Active of file

That is, memory reclamation operations are mainly aimed at file pages (file cache) and anonymous pages in memory.

The kernel for determining whether active or inactive is active will be processed and marked using the lru algorithm, which we will not explain in detail here.

The whole scanning process is divided into several cycles:

First scan the cgroup group on each zone

Then scan the page linked list with the memory of cgroup as the unit

The kernel will scan the active linked list of anon, put it infrequently in the inactive linked list, and then scan the inactive linked list to move the active ones back to active.

When doing swap, swap out the pages of inactive first.

If it is a file file mapping page page, then determine whether it is dirty data, if it is dirty data, write back, not dirty data can be directly released.

In this way, the behavior of memory recycling reclaims the use of two types of memory:

One is the anonymous page memory of anon, and the main recovery method is swap.

The other is the file-backed file mapping page, and the main means of release are writeback and emptying.

Because for filebased memory, there is no need to swap, its data is already on the hard disk, reclaim this part of the memory as long as it is written back when there is dirty data, and empty the memory, and later need to read back from the corresponding file.

Anonymous pages and file cache are organized by four linked lists in memory, and the recycling process is mainly aimed at scanning and operating these four linked lists.

2. What exactly is swappiness used to regulate? Many people should know that / proc/sys/vm/swappiness is a file that can be used to adjust parameters related to swap. The default value for this file is 60, and the range of possible values is 0-100.

It's easy to give you a hint: I'm a percentage!

So what exactly does this document mean? Let's first take a look at the instructions:

= =

Swappiness

This control is used to define how aggressive the kernel will swap memory pages. Higher values will increase agressiveness, lower values decrease the amount of swap.

A value of 0 instructs the kernel not to initiate swap until the amount of free and file-backed pages is less than the high water mark in a zone.

The default value is 60.

= =

The value of this file is used to define how actively the kernel uses swap:

The higher the value, the more actively the kernel will use swap

The lower the value, the less motivated to use swap.

If this value is 0, no swap will occur until the total number of pages used by free and file-backed is less than the high water mark (high water mark).

Here we can understand the meaning of the word file-backed, which is actually the size of the file mapping page mentioned above.

So what exactly is the role of this swappiness? Let's think about this in a different way. Suppose we design a memory recovery mechanism to consider writing part of the memory to the swap partition, writing back and emptying part of the file-backed memory, and the rest of the memory, how will we design it?

I think we should mainly consider the following issues: if there are two ways to reclaim memory (anonymous page swapping and file cache emptying), then I should consider when to do more file writebacks and when to do more swap swapping. To put it bluntly, it is to balance the use of the two recycling methods in order to achieve the best.

If the memory that meets the swap condition is long, is it possible not to swap it all out? For example, there is 100m of memory that can be swapped, but only 50m of memory is needed at present. In fact, only 50m of memory can be swapped. There is no need to swap everything that can be swapped out.

Analyzing the code shows that the Linux kernel implements this part of the logic in the get_scan_count () method, which is called by shrink_lruvec ().

Get_sacn_count () deals with the above logic, and swappiness is a parameter it needs, which actually instructs the kernel whether it is more likely to empty file-backed memory or to swap anonymous pages when emptying memory.

Of course, this is just a tendency, which means that if both are enough, which one would you prefer to use? if there is not enough, then the exchange should be exchanged.

Take a brief look at the processing section of the get_sacn_count () function, where the first processing for swappiness is:

The note here is very clear:

If swappiness is set to 100, anonymous pages and files will be recycled with the same priority.

Obviously, the use of emptying files will help reduce the IO pressure that may be caused by memory recycling.

Because if the data in file-backed is not dirty data, then you don't have to write back, so there is no IO, and once exchanged, it will definitely result in IO.

So by default, the system sets the value of swappiness to 60, so that when memory is reclaimed, the cache memory of file-backed files will be emptied more, and the kernel will be more likely to empty the cache rather than swap it.

If the swappiness value here is 60, does it mean that when the kernel is reclaimed, it will do the corresponding swap and empty the file-backed space according to the ratio of 60 to 140? It's not.

When doing this ratio calculation, the kernel also refers to other information about current memory usage. For those who are interested in how to deal with this, you can take a detailed look at the implementation of get_sacn_count (), which will not be explained in this article.

The concept we need to make clear here is whether the value of swappiness is used to control memory collection, whether more anonymous pages or more file cache are reclaimed.

If swappiness is set to 0, does the kernel not do swap at all? The answer is also no.

First of all, when there is really not enough memory, you still want to swap if you want to swap.

Second, there is a logic in the kernel that leads to the direct use of swap, which is handled by the kernel code:

The logic here is that if global recycling is triggered and zonefile + zonefree

When dealing with scan_balance later, if its value is SCAN_ANON, then the swap operation for anonymous pages must be done.

To understand this behavior, we first need to figure out what a high water mark (high_wmark_pages) is.

3. When will kswapd perform swap operations? Let's go back to the two memory recovery mechanisms of kswapd cycle checking and direct memory recovery.

Direct memory recycling is easier to understand. When the requested memory is larger than the remaining memory, it will trigger direct recycling.

So what are the conditions under which the kswapd process triggers recycling during a cycle check?

From a design point of view, the kswapd process detects memory periodically and begins to reclaim memory when it reaches a certain threshold.

This so-called threshold can be understood as the current use pressure of memory, that is, although we still have remaining memory, when the remaining memory is relatively small, that is, when the memory pressure is high, we should start trying to recover some memory. Only in this way can we ensure that the system has as much memory as possible for sudden memory requests.

4. What is the memory water level mark? (watermark) so how do you describe the stress of memory usage? The Linux kernel uses the concept of water mark (watermark) to describe this stress situation.

Linux sets three memory water levels for memory usage: high, low, and min. They are marked with the following meanings:

More than high in the remaining memory means there is more memory left. At present, there is little pressure on the use of memory.

The range of high-low indicates that there is some pressure on the remaining memory.

Low-min indicates that there is a lot of pressure on memory and there is not much memory left.

Min is the smallest water mark, and when the remaining memory reaches this state, it indicates that the memory is under a lot of pressure.

Less than min this part of the memory, the kernel is reserved for specific situations, generally will not be allocated.

Memory reclamation behavior is based on the water level mark of the remaining memory:

When the system's remaining memory is lower than watermark [low], the kernel's kswapd comes into play for memory recovery. Stop until the remaining memory reaches watermark [high]. Direct reclamation (direct reclaim) is triggered if memory consumption causes the remaining memory to reach or exceed watermark [min]. After understanding the concept of water mark, zonefile + zonefree

The zonefile here is equivalent to the total number of file mappings in memory, and the zonefree is equivalent to the total amount of memory left.

The kernel generally believes that if zonefile is still available, you can try to get some memory by emptying the file cache instead of swapping anon memory only with swap.

The whole concept of judgment is that in the state of global recycling (with the global_reclaim (sc) tag), if the current value of total file mapped memory + total remaining memory is evaluated less than or equal to the watermark [high] flag, direct swap can be carried out.

This is to prevent you from entering the cache trap, as described in the code comments.

The impact of this judgment on the system is that when swappiness is set to 0, swapping may also occur when there is remaining memory.

So how are the watermark related values calculated? All memory watermark tags are calculated based on the current total memory size and an adjustable parameter, which is: / proc/sys/vm/min_free_kbytes

First of all, this parameter itself determines the value of watermark [min] for each zone in the system.

Then the kernel calculates the low water level and the high water level of each zone according to the size of the zone and referring to the memory size of each zone.

For specific logic, you can see this file in the source code directory:

Mm/page_alloc.c

In the system, you can view information and usage of the current system from the / proc/zoneinfo file.

We will find that the above logic related to memory management is in units of zone, where zone refers to memory partition management.

Linux divides the memory into several regions, mainly:

Direct access area (DMA)

General area (Normal)

High-end memory area (HighMemory)

Kernel access to different areas of memory varies in addressing and efficiency due to hardware structure factors. On the NUMA architecture, different CPU manages different zone.

Related parameter setting zone_reclaim_mode:zone_reclaim_mode mode is a mode that has been added to the kernel since the end of version 2.6. it can be used to manage the option of reclaiming memory from within a memory area (zone) or from other zone when memory is exhausted. We can adjust this parameter through the / proc/sys/vm/zone_reclaim_mode file.

When requesting memory (in the kernel's get_page_from_freelist () method), if the kernel does not have enough memory available in the current zone, it will decide whether to find free memory from the next zone or reclaim it within the zone according to the setting of the zone_reclaim_mode. A value of 0 indicates that the available memory can be found from the next zone, and a value other than 0 means that the available memory can be reclaimed locally.

The value that this file can set and its meaning is as follows: echo 0 > / proc/sys/vm/zone_reclaim_mode: means that zone_reclaim mode is turned off and memory can be reclaimed from other zone or NUMA nodes.

Echo 1 > / proc/sys/vm/zone_reclaim_mode: means that the zone_reclaim mode is turned on, so that memory collection only occurs within the local node.

Echo 2 > / proc/sys/vm/zone_reclaim_mode: when reclaiming memory locally, dirty data in cache can be written back to the hard disk to reclaim memory.

Echo 4 > / proc/sys/vm/zone_reclaim_mode: memory can be reclaimed in swap mode.

Different parameter configurations will have different effects on the memory usage of other memory nodes in the NUMA environment. You can set them according to your own situation to optimize your application.

By default, zone_reclaim mode is off. This can improve efficiency in many application scenarios, such as file servers, or applications that rely on more cache in memory.

Such scenarios are more dependent on memory cache speed than on memory speed of the process itself, so we would rather request memory from other zone than clear the local cache.

If you determine that the memory requirement is greater than the cache in the application scenario, and try to avoid performance degradation caused by memory access across NUMA nodes, you can turn on zone_reclaim mode.

At this point, the page allocator will first reclaim recyclable memory that is easy to recycle (mainly page cache pages that are not currently in use), and then reclaim other memory.

Turning on writeback in local recycling mode may cause a large amount of dirty data writeback processing on other memory nodes. If a memory zone is full, the writeback of dirty data will also affect the processing speed of the process and create a processing bottleneck.

This degrades the performance of processes associated with one memory node because processes can no longer use memory on other nodes. However, it will increase the isolation between nodes, and the related processes of other nodes will not suffer performance degradation due to memory reclamation on another node.

Unless there is a change in the memory limit policy or cpuset configuration for the local node, the restriction on swap will effectively constrain the swap to occur only on the area managed by the local memory node.

Min_unmapped_ratio: this parameter takes effect only on the kernel of the NUMA architecture. This value represents the percentage of the total number of pages per memory region on the NUMA.

In zone_reclaim_mode mode, regional memory collection occurs only when the memory usage of the relevant area reaches this percentage.

When zone_reclaim_mode is set to 4, the kernel compares all file-backed and anonymous mapping pages, including pages consumed by swapcache, and tmpfs files to see if the total memory usage exceeds this percentage.

In the case of other settings, only unmapped pages based on general files are compared, regardless of other related pages.

Page-cluster:page-cluster is used to control the number of pages read continuously when data is transferred from the swap space, which is equivalent to a pre-read of the swap space. Continuity here refers to continuity on the swap space, not on the memory address.

Because the swap space is generally on the hard disk, the continuous reading of the hard disk device will reduce the addressing of the magnetic head and improve the reading efficiency.

The value set in this file is an exponent of 2. That is, if set to 0, the number of pre-read swap pages is 2 to the power of 0, equal to 1 page. If set to 3, it is 2 to the third power, which is equal to 8 pages.

At the same time, a setting of 0 also means that the read-ahead function is turned off. The file default value is 3. We can set the number of pre-read pages according to the load status of our system.

Swap-related manipulation commands can use mkswap to create a partition or file into swap space. Swapon can view the current swap space and enable a swap partition or file. Swapoff can close the swap space.

Let's use an example of a file to demonstrate the whole operation: making a swap file:

Enable the swap file:

Close the swap space:

5. What is the use of swap partition priority (priority)? There is also a concept of priority (Priority) when using multiple swap partitions or files.

In swapon, we can use the-p parameter to specify the priority of the relevant swap space. The higher the value, the higher the priority. The range of numbers you can specify is-1 to 32767.

The kernel always uses high priority space first and then low priority space when using swap space.

Of course, if the priority of multiple swap spaces is set to the same, then the two swap spaces will be used in parallel in a polling manner.

If two swap are placed on two different hard drives, the same priority can have an effect similar to RAID0, increasing the read and write efficiency of the swap.

In addition, using mlock () when programming can also mark the specified memory that it will not be swapped out. For help, please refer to man 2 mlock.

Finally, the use of swap recommendations, for different load states of the system is not the same. Sometimes we want the swap to be larger so that when there is not enough memory, it will not trigger oom-killer and cause some key processes to be killed, such as database business.

Sometimes we want not to swap, because when a large number of processes burst and cause memory to burst, swap will cause IO to run to death, and the whole system will be stuck, unable to log in, and unable to handle it.

At this time, we hope not to swap, even if the emergence of oom-killer will not cause much impact, but can not allow the server to crash like dominoes because of IO jam, and can not log in. Stateless apache that runs cpu operations is a program with a process pool architecture like this.

So:

How exactly do you use swap?

Yes or no?

Set big or small?

How should the relevant parameters be configured?

It depends on our own production environment.

After reading this article, I hope you can understand some in-depth knowledge of swap.

Qantia: is it possible to use swap in a system with a large amount of memory left?

A: it's possible that if some phase of the operation starts with this condition, "zonefile+zonefree"

Does setting swappiness to 0 mean turning off swap?

A: no, you need to use the swapoff command to turn off swap. Swappiness is just a parameter used to balance cache recycling and swap swapping when memory reclamation occurs. An adjustment of 0 means that memory is reclaimed by clearing the cache as much as possible.

A: does setting swappiness to 100 mean that the system will use as little remaining memory as possible and use more swap?

No, a value of 100 means that when memory is reclaimed, reclaiming memory from cache has the same priority as swap swapping. That is, if 100m of memory is currently required, there is a higher chance that 50m of memory will be cleared from cache, then 50m of anonymous pages will be swapped out, and the reclaimed memory will be used by applications. But it also depends on whether there is space in the cache and whether the swap can swap 50m. The kernel is just trying to balance them.

When does the kswapd process start memory collection?

A: kswapd decides whether to start reclaiming memory based on the memory water level mark. If the mark reaches low, it starts recycling until the remaining memory reaches the high mark.

How do I view the memory water level mark of the current system?

A: cat / proc/zoneinfo .

On how to understand the Linux system SWAP to share here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.