What are the configuration recommendations and shutdown methods for using THP 07/12 Update SLTechnology News&Howtos

What are the configuration recommendations and shutdown methods for using THP

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

In this issue, the editor will bring you suggestions on the configuration and closing methods of using THP. The article is rich in content and analyzes and describes for you from a professional point of view. I hope you can get something after reading this article.

Preface

We have encountered many cases of performance jitter caused by some features of the operating system in the production environment before, in which THP committed a large number of crimes, so the following will share with you the causes of performance jitter caused by THP, typical phenomena, analysis methods, etc., and finally give configuration recommendations and shutdown methods when using THP.

Introduction to THP (Transparent Huge Page)

The world is not black and white. THP is also an important feature of the kernel and continues to evolve. Its purpose is to reduce Page Fault by mapping page table items to larger memory, thereby improving the hit rate of TLB (Translation Lookaside Buffer, which is used by the memory management unit to improve the translation speed of virtual addresses to physical addresses). Combined with the memory hierarchical structure design principle, when the memory access locality of the program is good, THP will improve the performance, on the contrary, the advantage of THP will not only be lost, but also may turn into a demon, causing system instability. Unfortunately, the load access characteristics of the database are usually discrete.

Review of Linux memory Management

Before addressing the negative phenomena caused by THP, let's recall how the Linux operating system manages physical memory. For different architectures, the kernel corresponds to different memory layout diagrams. Among them, the user space is mapped through multi-level page tables to save the space needed for mapping management, while the kernel space uses linear mapping for simplicity and efficiency. When the kernel starts, the physical page is added to the partner system (Buddy System), which is allocated when the user requests memory and reclaimed when it is released. In order to take care of slow devices and take into account a variety of workload,Linux, page types are divided into anonymous pages (Anon Page) and file pages (Page Cache), and swapness, using Page Cache cache files (slow devices). Through swap cache and swapness, it is up to users to decide the ratio of insufficient memory according to load characteristics. In order to respond to users' memory request requirements as quickly as possible and to ensure that the system runs when memory resources are tight, Linux defines three water levels (high,low,min). When the remaining physical memory is lower than low and higher than min water mark, memory is asynchronously reclaimed through kswapd kernel threads when users apply for memory until the water mark is restored above high, if the speed of asynchronous recovery can not keep up with the speed of thread memory request. Direct memory collection that triggers synchronization, that is, all threads requesting memory participate in memory collection synchronously, lifting the water level together to obtain memory. At this point, if the pages that need to be recycled are clean, the blocking time caused by synchronization is relatively short, and vice versa (for example, tens, hundreds of ms or even s, depending on the speed of the back-end device). In addition to the watermark, when applying for large continuous memory, if the remaining physical memory is sufficient, but the fragmentation is relatively serious, the kernel may also trigger direct memory recovery when the memory is regular (depending on the fragmentation index, which will be described later). Therefore, direct memory reclamation and memory tidiness are the main delays that may be encountered on the memory path of the process request. Under the load with poor locality of memory access, THP will be behind the trigger of these two events.

The most typical feature-- Sys CPU usage soars

We found in multiple user sites that when allocating THP causes performance fluctuations, the most typical feature is the surge in Sys CPU utilization. The analysis of this feature is relatively simple. By capturing the on-cpu flame image by perf, we can see that all the threads in R state of our service are doing regular memory, and the page fault exception handling function is do_huge_pmd_anonymous_page, indicating that the current system does not have continuous 2m physical memory. Therefore, direct memory normalization is triggered, and the logic of direct memory normalization is very time-consuming, which is the reason for the increase in sys utilization.

Indirect feature-Sys load surge

Real systems are often complex, and when allocating THP or other high-level memory, the system does not do direct memory normalization, leaving the typical criminal characteristics mentioned above, but a mixture of other behaviors, such as direct memory recovery. The involvement of direct memory recycling makes things a little more complicated and confusing, for example, we first saw from the customer site that the remaining physical memory of normal zone is higher than the high water mark, but why does the system keep doing direct memory recycling? We go deep into the processing logic of slow memory allocation and know that the slow memory allocation path mainly has several steps: 1. The asynchronous memory is regular; 2. Direct memory recovery; 3. Direct memory neatness; 4. Oom recycling, after each step of processing, will try to allocate memory, if it can be allocated, then directly return to the page, skip the following section. The kernel provides a fragmentation index for each order of the partner system to indicate whether the memory allocation failure is due to insufficient memory or fragmentation. Associated with it is / proc/sys/vm/extfrag_threshold. When close to 1000, it means that allocation failure is mainly related to fragmentation. At this time, the kernel tends to do memory tidiness. When close to 0, it means that there is a greater correlation between allocation failure and insufficient memory, and the kernel tends to do memory recovery. As a result, direct memory collection occurs frequently when it is above the high water mark. However, the opening and use of THP occupies high-level memory, which accelerates the performance jitter caused by memory fragmentation.

For this feature, the judgment method is as follows:

Run sar-B to observe pgscand/s, which means the number of direct memory reclaims per second. If it persists greater than 0 over a period of time, you should continue to perform the next steps to troubleshoot.

Run cat / sys/lernel/debug/extfrag/extfrag_index to observe the memory fragmentation index, focusing on the fragmentation index of order > = 3. When it is close to 1.000, it indicates severe fragmentation, and when it approaches 0, it means that there is insufficient memory.

Run cat / proc/buddyinfo and cat / proc/pagetypeinfo to check the memory fragmentation, refer to the metric meaning reference (https://man7.org/linux/man-pages/man5/proc.5.html), and also pay attention to the number of remaining pages with order > = 3. The information displayed by pagetypeinfo is more detailed than that shown by buddyinfo. Grouping according to migration type (partner system implements anti-fragmentation through migration type), it should be noted that When pages of migration type Unmovable are clustered in order

< 3 时，说明内核 slab 碎片化严重，我们需要结合其他工具来排查具体原因，在本文就不做过多介绍了；对于 CentOS 7.6 等支持 BPF 的 kernel 也可以运行我们研发的 drsnoop，compactsnoop 工具对延迟进行定量分析，使用方法和解读方式请参考对应文档； (Opt) 使用 ftrace 抓取 mm_page_alloc_extfrag 事件，观察因内存碎片从备用迁移类型"盗取"页面的信息；非典型特征-- 异常的 RES 使用率我们在 AARCH64 服务器上，遇到过服务刚启动就占用几十个 G 物理内存的场景，通过观察 /proc/pid/smaps 文件可以看到内存大部分用于 THP，且 AARCH64 的 CentOS 7内核编译时选用的 PAGE SIZE 为 64K，因此相比 X86_64 平台的内存用量差出很多倍。在定位的过程中我们也顺便修复了 jemalloc 未完全关闭 THP 的bug: fix opt.thp:never still use THP with base_map。结语对于未对访存局部性进行优化的程序或负载本身就是离散的访存程序而言，将 THP 以及 THP defrag 设置为始终开启，对长时间运行的服务而言有害无益，且内核从 4.6 版本内核版本起才对 THP 的 defrag 提供了 defer，defer + madvise 等优化。因此对于我们常用的 CentOS 7 3.10 版本的内核来说，若程序需要使用 THP，则建议将 THP 的开关设置为 madvise，在程序中通过 madvise 系统调用来分配 THP，否则设置成 never 禁用掉是最佳选择：查看当前的 THP 模式： cat /sys/kernel/mm/transparent_hugepage/enabled 若值是 always 时，执行： echo never >

/ sys/kernel/mm/transparent_hugepage/enabledecho never > / sys/kernel/mm/transparent_hugepage/defrag

Complete the close operation. It should be noted that in order to prevent server restart failure, these two commands should be written to the .sevice file and handed over to systemd for management.

These are the configuration suggestions and closing methods for using THP shared by Xiaobian. If you happen to have similar doubts, please refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.