How is linux swap triggered? 10/15 Update SLTechnology News&Howtos

How is linux swap triggered?

2025-10-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

Today, I would like to share with you the relevant knowledge of how linux swap is triggered. The content is detailed and the logic is clear. I believe most people still know too much about this knowledge, so share this article for your reference. I hope you can get something after reading this article. Let's take a look at it.

Linux swap refers to the linux swap partition, which is an area on the disk, which can be a partition, a file, or a combination of both; swap is similar to Windows's virtual memory, that is, when there is insufficient memory, part of the hard disk space is virtual into memory, so as to solve the situation of insufficient memory capacity.

The operating environment of this tutorial: linux5.9.8 system, Dell G3 computer.

Linux swap

Linux's swap partition (swap), or memory replacement space (swap space), is an area of disk that can be a partition, a file, or a combination of them.

The function of SWAP is similar to "virtual memory" in Windows system. When physical memory is insufficient, part of the hard disk space is used as SWAP partition (virtual memory), so as to solve the situation of insufficient memory capacity.

SWAP means swapping. As the name implies, when a process requests OS for insufficient memory, OS will swap out data that is temporarily unused in memory and put it in the SWAP partition. This process is called SWAP OUT. When a process needs this data again and OS finds that there is still free physical memory, it swaps the data from the SWAP partition back into physical memory, a process called SWAP IN.

Of course, there is a limit to the size of the swap, and once the swap is used up, the operating system will trigger the OOM-Killer mechanism to kill the process that consumes the most memory to free memory.

Why do database systems dislike swap?

Obviously, the original intention of the swap mechanism is to alleviate the embarrassment of choosing a direct rough OOM process in order to run out of physical memory. But frankly, almost all databases don't like swap very much, whether it's MySQL, Oracal, MongoDB or HBase. Why? This is mainly related to the following two aspects:

1. Database systems are generally sensitive to response latency. If swap is used instead of memory, database service performance is bound to be unacceptable. For systems that are extremely sensitive to response latency, there is no difference between too much delay and the service is unavailable. What is more serious than the service unavailability is that the process is immortal in the swap scenario, which means that the system has been unavailable. Then think about whether it is a better choice if you do not use swap to directly oom, so that many high-availability systems will directly switch between master and slave, and users are basically unaware of it.

two。 In addition, for distributed systems such as HBase, they are not worried about the downfall of a node, but are just worried about the tamping of a node. If a node goes down, at most a small number of requests are temporarily unavailable and can be resumed after retry. However, a node tamping will tamp all distributed requests, server-side thread resources are occupied, resulting in the entire cluster request blocking, and even the cluster is dragged down.

From these two perspectives, it makes sense that all databases don't like swap!

The working Mechanism of swap

Since databases don't like swap, should we use the swapoff command to turn off the disk cache feature? No, you can think about what it means to turn off disk caching. No system in the actual production environment will be so radical, knowing that the world is never either zero or one, and everyone will more or less choose to walk in the middle, but some tend to be zero and some to one. Obviously, on the issue of swap, the database must choose to use as little as possible. Several requirements of the HBase official documentation actually implement this policy: to minimize the impact of swap. Only by knowing yourself and the enemy can you be invincible. In order to reduce the impact of swap, you must find out how Linux memory recovery works, so as not to leave out any possible doubts.

Let's take a look at how swap is triggered.

To put it simply, Linux triggers memory collection in two scenarios, one is that memory collection is triggered immediately when there is not enough free memory during memory allocation, and the other is that a daemon (swapd process) is started to periodically check the system memory and actively trigger memory collection when the available memory is reduced to a certain threshold. There is nothing to say about the first scenario, but let's focus on the second scenario, as shown in the following figure:

This leads to the first parameter we are concerned with: vm.min_free_kbytes, which represents the minimum watermark [min] of free memory reserved by the system, and affects watermark [low] and watermark [high]. You can simply think of it as:

Watermark [min] = min_free_ kbyteswatermark [low] = watermark [min] * 5 / 4 = min_free_kbytes * 5 / 4watermark [high] = watermark [min] * 3 / 2 = min_free_kbytes * 3 / 2watermark [high]-watermark [low] = watermark [low]-watermark [min] = min_free_kbytes / 4

It can be seen that these water level lines of LInux are closely related to the parameter min_free_kbytes. The importance of min_free_kbytes to the system is self-evident, neither too big nor too small.

If the min_free_kbytes is too small, the buffer of the water level between [min,low] will be very small. In the process of kswapd recovery, once the upper layer applies for memory too quickly (typical application: database), the free memory will easily fall below watermark [min]. At this time, the kernel will carry out direct reclaim (direct recycling), recycle directly in the context of the application process, and then use the recycled free pages to satisfy the memory application. So it actually blocks the application, causing a certain response delay. Of course, the min_free_kbytes should not be too large, on the one hand, it will reduce the application process memory and waste system memory resources, on the other hand, it will cause the kswapd process to spend a lot of time on memory recovery. Let's take a look at this process. Is it similar to the old generation collection trigger mechanism of CMS algorithm in Java garbage collection mechanism? think about the parameter-XX:CMSInitiatingOccupancyFraction, isn't it? The official document requires that the min_free_kbytes should not be less than 1 gigabyte (set 8 gigabytes in large memory systems), that is, do not easily trigger direct collection.

At this point, we have basically explained the memory recycling trigger mechanism of Linux and the first parameter we are concerned about, vm.min_free_kbytes. Let's take a brief look at what Linux reclaims. There are two main types of Linux memory recycling objects:

1. File caching, which is easy to understand, in order to prevent file data from being read from the hard disk every time, the system will store hot data in memory to improve performance. If you just read the file, the memory recovery only needs to release this part of the memory, and the next time you read the file data can be read directly from the hard disk (similar to HBase file cache). If not only the file is read out, but also the cached file data is modified (dirty data), reclaiming memory needs to write this part of the data file to the hard disk and then release it (similar to MySQL file cache).

two。 Anonymous memory, this part of the memory has no actual carrier, unlike the file cache has a hard disk file, such as typical heap, stack data and so on. This part of memory cannot be directly released or written back to a medium similar to a file when it is recycled, which leads to the swap mechanism, which swaps this kind of memory into the hard disk and loads it when needed.

I don't care what algorithm Linux uses to confirm which file cache or anonymous memory needs to be reclaimed. If you are interested, please refer to it here. But there is a question we need to think about: since there are two types of memory that can be reclaimed, how does Linux decide which type of memory to reclaim when both types of memory can be reclaimed? Or will both be recycled? Here comes our second parameter of concern: swappiness, which defines how actively the kernel uses swap. The higher the value, the more active the kernel will use swap. The lower the value, the less active the kernel will use swap. The value ranges from 0 to 100, and the default is 60. How on earth is this swappiness realized? The specific principle is very complicated. To put it simply, swappiness achieves this effect by controlling whether more anonymous pages are reclaimed or more file caches are reclaimed when memory is reclaimed. Swappiness equals 100, which means that anonymous memory and file cache will be reclaimed with the same priority. The default of 60 means that file cache will be reclaimed first. As to why file cache should be reclaimed first, you can think about it (reclaiming file cache usually does not cause IO operation and has little impact on system performance). As far as the database is concerned, swap needs to be avoided as much as possible, so you need to set it to 0. It should be noted here that a setting of 0 does not mean that swap is not executed!

So far, we have talked all the way from Linux memory collection trigger mechanism, Linux memory collection object to swap, explaining the parameters min_free_kbytes and swappiness. Next, let's take a look at another parameter related to swap: zone_reclaim_mode. The document says that setting this parameter to 0 turns off NUMA's zone reclaim. What's going on? When it comes to NUMA, databases are unhappy again, and many DBA have been ripped off. So here are three simple questions: what is NUMA? What is the relationship between NUMA and swap? What is the specific meaning of zone_reclaim_mode?

NUMA (Non-Uniform Memory Access) is relative to UMA, and both are the design architecture of CPU. The early CPU is designed as UMA structure, as shown in the following figure (image from the network):

In order to alleviate the channel bottleneck when multi-core CPU reads the same memory, the chip engineer designs the NUMA structure, as shown in the following figure (image from the network):

This architecture can well solve the problem of UMA, that is, different CPU have their own memory area. In order to achieve "memory isolation" between CPU, two points of support at the software level are needed:

1. Memory allocation needs to be allocated in the exclusive memory area of the CPU where the request thread is currently located. If allocated to other CPU exclusive memory areas, isolation is bound to be affected to some extent, and memory access performance across the bus is bound to be reduced to some extent.

two。 In addition, if there is not enough local memory (exclusive memory), priority is given to eliminating memory pages in local memory rather than checking to see if there is free memory borrowing in the remote memory area.

In this way, the isolation is indeed good, but there is also a problem: the feature of NUMA may lead to uneven use of CPU memory, insufficient use of some CPU-specific memory and frequent recycling, resulting in a large number of swap and serious jitter in system response delay. At the same time, other parts of CPU-specific memory may be free. This will lead to a strange phenomenon: using the free command to check that the current system still has some free physical memory, but the system continues to swap, resulting in a sharp decline in the performance of some applications. See teacher Ye Jinrong's MySQL case study: "find the culprit of SWAP on MySQL server".

Therefore, for small memory applications, this problem caused by NUMA is not prominent, on the contrary, the performance improvement brought about by local memory is considerable. However, for large memory users such as databases, the stability risks caused by the default policy of NUMA are unacceptable. Therefore, databases are strongly demanding improvements to the default policy of NUMA, which can be improved in two areas:

1. Changing the memory allocation policy from the default affinity mode to interleave mode will scatter the memory page to different CPU zone. In this way, the problem of possible uneven distribution of memory is solved, which alleviates the weird problem in the above case to some extent. For MongoDB, you will be prompted to use the interleave memory allocation policy at startup:

WARNING: You are running on a NUMA machine.We suggest launching mongod like this to avoid performance problems:numactl-interleave=all mongod [other options]

two。 Improved memory recovery strategy: today's third protagonist parameter zone_reclaim_mode is finally presented here, which defines different memory recovery strategies under the NUMA architecture. You can take a value of 0, 1, 3, 4, where 0 means you can allocate memory to other memory areas if there is not enough local memory; 1 means that you can reclaim and then allocate memory locally if there is not enough local memory. 3 indicates that local recycling reclaims file cache objects as first as possible; 4 indicates that local recycling gives priority to using swap to reclaim anonymous memory. It can be seen that the recommended configuration of zone_reclaim_mode=0 by HBase reduces the probability of swap to some extent.

It's not all about swap.

So far, we have discussed three system parameters related to swap, and have made an in-depth interpretation of these three parameters around the knowledge points of Linux system memory allocation, swap and NUMA. In addition, for database systems, there are two very important parameters that require special attention:

1. IO scheduling strategy: there are many explanations on this topic online, and I don't intend to elaborate on it here, only the results are given. Usually for the OLTP database of Sata disk, the scheduling strategy of deadline algorithm is the best choice.

2. The THP (transparent huge pages) feature is off. The author has wondered about THP features for a long time, and there are two main doubts: one is whether THP and HugePage are the same thing, and the other is why HBase asked to turn off THP. After consulting the relevant documents for many times, I finally found some clues. Here are four small dots to explain the THP feature:

(1) what is HugePage?

There are many explanations of HugePage on the Internet, which you can search and read. To put it simply, computer memory is addressed by table mapping (memory index table). At present, the system memory takes 4KB as a page, as the smallest unit of memory addressing. As memory grows, the size of the memory index table will continue to increase. For a machine with 256GB of memory, if you use 4KB small pages, only the size of the index table will be about 4G. To know that this index table must be installed in memory, and in CPU memory, too large will occur a lot of miss, memory addressing performance will be degraded.

HugePage is to solve this problem, HugePage uses 2MB-sized large pages instead of traditional small pages to manage memory, so that the memory index table size can be controlled very small, and then all installed in CPU memory to prevent the emergence of miss.

(2) what is THP (Transparent Huge Pages)?

HugePage is a large page theory, so how exactly do you use the HugePage feature? At present, the system provides two ways to use it, one is called Static Huge Pages, the other is Transparent Huge Pages. The former can be known as a static management strategy according to the name, which requires users to manually configure the number of large pages according to the system memory size, so that the corresponding number of large pages will be generated when the system starts, and will not be changed later. Transparent Huge Pages is a dynamic management strategy, which dynamically allocates large pages to the application at run time, and manages these large pages, which is completely transparent to users and does not require any configuration. In addition, THP is currently only targeted at anonymous memory areas.

(3) Why does HBase (database) require THP feature to be turned off?

THP is a dynamic management strategy that allocates and manages large pages at run time, so there will be a certain degree of allocation delay, which is unacceptable for database systems that pursue response latency. In addition, THP has many other disadvantages. You can refer to this article "why-tokudb-hates-transparent-hugepages".

(4) how much is the impact of THP off / on on HBase read and write performance?

In order to verify the impact of THP on and off on HBase performance, I did a simple test in the test environment: the test cluster is only one RegionServer, and the test load is the read-write ratio of 1:1. THP has two options for always and never in some systems, and an option called madvise is added in some systems. You can use the command echo never/always > / sys/kernel/mm/transparent_hugepage/enabled to turn THP off / on. The test results are shown in the following figure:

As shown in the figure above, HBase has the best performance and is relatively stable in the case of TPH shutdown (never). On the other hand, the performance of the scene opened by THP (always) is about 30% lower than that of the closed scene, and the curve jitter is great. As you can see, remember to close THP on the HBase line.

These are all the contents of the article "how linux swap is triggered". Thank you for reading! I believe you will gain a lot after reading this article. The editor will update different knowledge for you every day. If you want to learn more knowledge, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.