The first Application of non-volatile memory Ali production Environment: summary of Tair NVM Best practices 09/21 Update SLTechnology News&Howtos

The first Application of non-volatile memory Ali production Environment: summary of Tair NVM Best practices

2025-09-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

This paper introduces the first application of the production environment of Alibaba Group with non-volatile memory: the situation of online operation, the problems encountered in using NVM and the process of optimization. finally, it summarizes the design points of building cache services based on NVM. I hope these practical summaries can inspire your work.

Brief introduction

Tair MDB is a caching service widely used in Alibaba ecosystem. It uses non-volatile memory Non-volatile memory (NVM) as a supplement to DRAM and auxiliary DRAM as a back-end storage medium. it has been online grayscale in the production environment since Tmall Shopping Festival, and has experienced two full-link pressure tests. In the process of using NVM, Tair MDB encountered some problems such as write imbalance and lock overhead, and achieved remarkable results after optimization.

Through this series of optimization work and production environment practice, the Tair engineering team summed up some design guidelines for implementing caching services on Non-volatile memory (NVM) / Persistent memory (PMEM). It is believed that these guidelines will be very instructive for other products that are willing to use Non-volatile memory for optimization.

Background

Tair Mdb mainly serves cache scenarios and has a large number of deployments and uses within Alibaba Group. With the introduction of user-mode network protocol stack and unlocked data structure, the limit capacity of stand-alone QPS has reached the level of 1000W +. All the data of Tair Mdb are stored in memory. With the increase of the limit capacity of stand-alone QPS, memory capacity has gradually become the main factor that limits the size of the cluster.

The capacity of a single DIMM of NVM products is much larger than that of DRAM DIMM, and the price is more advantageous than DRAM. Storing Tair Mdb data on NVM is a direction to break through the limitation of stand-alone memory capacity.

Production environment

Effect.

End-to-end, the average read and write latency is the same as the data of the nodes using DRAM under the same software version; the service behavior is normal. The pressure in the production environment has not reached the limit of the Tair MDB node, and the following chapters will introduce the problems and solutions we encounter during stress testing.

Cost

As mentioned earlier, the maximum capacity of a single NVM DIMM is higher than that of DRAM DIMM, and the price of the same capacity is cheaper than that of DRAM. The size of Tair MDB capacity clusters can be greatly reduced if NVM is used to supplement the lack of memory capacity. Taking into account the price of the machine, electricity, rack and other factors, the cost can be reduced by about 30% to 50%.

Principle

Mode of use

The way Tair MDB uses NVM devices is to mount NVM as a block device using Pmem-Aware File System (DAX mount mode). The corresponding operation of allocating NVM space is to create and open files on the corresponding file system path, and allocate space using posix_fallocate.

Memory allocator

NVM itself has non-volatile characteristics. For cache service Tair MDB, NVM is regarded as a volatile device, without considering the atomicity of the operation and the recovery operation after crash, and there is no need to explicitly call commands such as clflush/clwb to force the contents of CPU Cache back to the media.

When using DRAM space, there are memory allocators such as tcmalloc/jemalloc to choose from, and now the NVM space is exposed to a file (or a character device), so how to use the memory allocator is the first thing to consider. The open source project pmem [1] maintains a volatile memory management library libmemkind and an easy-to-use malloc/free-like API, which can be considered when most applications are connected.

Tair MDB is implemented without using libmemkind [2]. The memory layout of Tair MDB is described below and the reasons for this choice are explained.

Memory layout

Tair MDB uses slab mechanism in memory management. Instead of dynamically allocating anonymous memory when using it, it first allocates a large chunk of memory when the system starts. The built-in memory management module continuously distributes metadata, data pages, etc., on this large block of memory, as shown in the following figure:

The memory used by Tair MDB is mainly divided into the following parts:

Cache Meta, which stores metadata information such as the maximum number of shards, as well as index information for Slab Manager.

Slab Manager, which manages fixed-size Slab in each Slab Manager.

Hashmap, the global hash table index, uses a linear collision chain to deal with hash conflicts, and all key access needs to go through Hashmap.

Page pool, memory pool, after startup, the memory is divided into 1m pages, and Slab Manager requests the page from Page pool and formats it to the specified slab size.

Tair Mdb initializes all available memory at startup, and subsequent data stores do not need to allocate memory dynamically from the operating system.

When using NVM, mmap the corresponding files to memory to obtain the virtual address space, and the built-in memory management module can make use of this space transparently. So in this process, there is no need to call malloc/free to manage the space on the NVM device.

Pressure test

Tair MDB encountered some problems in the process of stress testing after using NVM as a supplement to DRAM and auxiliary DRAM as back-end storage. This chapter will introduce the specific performance of these problems and optimization methods.

problem

After using NVM, the Tair MDB is pressure tested with an entry of 100 bytes, and the following data is obtained:

Intra-engine delay

Client observation QPS

The Read QPS / latency based on NVM is equivalent to DRAM, and Write TPS is about 1 to 3 of DRAM.

Analysis.

The loss of write performance is on the lock as a result of the perf, and this critical section of lock management contains write operations to the Page mentioned in the memory layout above. It is suspected that this situation is caused by higher write latency on NVM than on DRAM.

In the process of pressure testing, using pcm [3] to check the bandwidth statistics of NVM DIMMS, it is observed that the write on one DIMM is very uneven, which is about twice that of other DIMM in stable condition.

The details are shown in the following figure:

Here is a brief introduction to the placement strategy of NVM DIMM.

Placement strategy

Now 4 NVM DIMM are placed using a single socket, and the distribution is similar to the following figure:

This placement strategy is called 2-2-1. Each socket has four DIMM, which belong to four different channels. When using multiple channels, CPU performs interleave in order to make efficient use of memory bandwidth. Under the current placement policy and configuration, CPU uses 4K as the unit and interleave in DIMM order.

Cause of imbalance

From the strategy of memory interleaving, it can be inferred that the same region is written every time, and this area is located on that uneven DIMM, resulting in a significantly higher amount of writing of this DIMM than other DIMM.

Then the next problem that needs to be solved is to find the processing logic that leads to the hot spot of writing. The method of trivial is to find some suspicious points and rule them out one by one. Here is a description of the methods used by the Tair engineering team.

Optimize

As mentioned above, write hotspots lead to uneven NVM DIMM access, so the first step of optimization is to find out write hotspots and do some processing, such as breaking up hotspot visits, or putting hotspot access areas into the DRAM.

Find write hotspots

For finding write hotspots, the Tair engineering team used Pin [4]. It is mentioned above that Tair MDB is to mmap the file to obtain the logical address to manipulate memory. So we can use Pin to grab the return value of mmap, and then get the logical address of NVM in the program memory space. Then we continue to use Pin to stub all program instructions that operate on memory, counting the number of writes to each byte in the address space to which NVM is mapped.

Finally, it is found that the write hotspot does exist, and the corresponding area is the metadata of Page. The solution to write hot spots has been considered: add padding to interleave the hot spots on each DIMM, store basically the same hot data in DIMM, move the hot spots back to DRAM, etc., and finally choose to move slab_manager and page_info back to DRAM. The modified structure is as follows:

At this point, the imbalance problem is solved, the TPS is increased from 85w to 140w, and the write delay in the engine is reduced from 40us to 12us.

The lock cost is too high.

When the TPS is 140w, notice that the overhead of the pthread_spin_lock mentioned above is still very high. As you can see from the result of perf record, the call stack consumed by pthread_spin_lock is:

Through the analysis of batch_alloc_item, it is found that the initialization of item in page in the critical area will produce a large number of writes to NVM. Because NVM writes more slowly than DRAM, this is a time-consuming place.

In fact, according to the logic of Tair MDB, you only need to add locks when you put page link into slab_manager. So here the initialization of the item is moved out of the critical area. After that, all the write operations to NVM in the critical area of Tair MDB code are checked and optimized accordingly.

After optimization, the pthread_spin_lock overhead is reduced to the normal range, the TPS is increased to 170w, and the write delay in the engine is 9us.

Optimization result

Optimizations such as write load balancing and lock granularity refinement have effectively reduced the Latency,TPS to 170w, which is 100% higher than the previous data. Due to the difference in media, there is still a gap of about 30% compared with the write performance of DRAM, but for scenarios where the cache service reads more and writes less, this gap will not have much impact on the overall performance.

Design guidelines

Based on the above optimization work and the practice of the production environment, the Tair engineering team summarized the design guidelines for implementing caching services based on NVM, which are closely related to the hardware features used.

Hardware characteristics

Unique NVM hardware features that have an impact on caching service design:

Higher density and cheaper than DRAM

Higher latency than DRAM and lower bandwidth than DRAM

There is an imbalance between reading and writing, and the write delay is higher than that of reading.

There is wear and tear on the hardware, and writing a single position frequently will increase the wear.

Design criteria

Rule A: avoid writing hotspots

Tair MDB has encountered the problem of writing hot spots in the process of using NVM, which will increase the wear of the media and lead to load imbalance (the write pressure is on a certain DIMM, so it can not make full use of the bandwidth of all DIMM). In addition to memory layout (metadata and data mixed storage) will lead to write hotspots, business access behavior will also lead to write hotspots.

Here, the Tair engineering team summarizes several ways to avoid writing hotspots:

Separate the metadata from the data and move the metadata to the DRAM. Metadata is accessed more frequently than data, and page_info in the Tair MDB mentioned earlier belongs to metadata. This can alleviate the disadvantage that NVM has higher write latency than DRAM from the upper layer.

The upper layer implements the logic of Copy-On-Write. This will reduce the wear and tear on the hardware of a specific area in some scenarios. When a piece of data is updated in Tair MDB, the previous entry will not be in-place update, but a new entry will be added to the head of the hashmap conflict chain, and the previous entry will be deleted asynchronously.

Normal detects hot writes, dynamically migrates to DRAM, and performs write merging. For the hotspot writes caused by the business access behavior mentioned above, Tair MDB will normally detect hotspot writes and merge them to reduce access to the underlying media.

Guideline B: reduce access to critical areas

Because the write latency of NVM is higher than that of DRAM, when the critical region contains the operation of NVM, the influence of the critical region will be magnified, resulting in a reduction of the parallelism of the upper layer.

The lock overhead mentioned earlier is not observed when Tair MDB is running on DRAM because it is assumed that the overhead of this critical area is relatively small when running on DRAM, but this assumption is not true when using NVM. This is also a problem often encountered when using new media. Some assumptions that may not be realized in the previous software process are no longer valid on the new media, so it is necessary to make some adjustments to the original process.

In view of the above reasons, the Tair engineering team suggests that when caching services use NVM, they should try their best to design unlocked data storage to reduce access to critical areas and avoid cascading effects caused by increased latency.

Tair MDB introduces user-mode RCU, which makes lock-free transformation on most of the access paths, which greatly reduces the impact of NVM latency on the upper layer. Rule C: implement an appropriate allocator

The allocator is the basic component of the business using NVM equipment, the concurrency of the allocator will directly affect the efficiency of the software, and the space management of the allocator will determine the space utilization. Designing and implementing or choosing an allocator suitable for software features is the key to using NVM for caching services.

From the practical point of view of Tair MDB, the allocator for NVM should have the following functions and features:

Defragmentation: because NVM has higher density and larger capacity, it wastes more space than DRAM at the same fragmentation rate. Due to the existence of the defragmentation mechanism, it is necessary for the upper application to avoid in-place update and try to ensure that the space allocated by the allocator is fixed size.

Quota with Threadlocal is required: similar to the reduction of critical section access mentioned above, if there is no quota for Threadlocal, the delay in allocating resources from the global resource pool reduces the concurrency of allocation operations.

Capacity-aware, the allocator needs to perceive the space it can manage: the cache service needs to expand or reduce the capacity of the managed space, and the allocator needs to provide corresponding functions to meet this requirement.

The above design criteria have been tested in practice, feasible, and will have a beneficial impact on the application. I believe it will also be of great help to other products that want to use NVM.

Future work

It is mentioned above that Tair MDB is still a volatile device when using NVM, taking advantage of high density and low price to reduce the cost of the overall service. In the future, the Tair engineering team will strive to make better use of the non-volatile characteristics of NVM, tap the dividends of new hardware, and endow the business and other upper-level services.

This article is the beginning of the series sharing of NVM by the Tair team of the Storage Technology Division of Alibaba Group, and then we will launch our thoughts and achievements in the field of NVM one after another.

Distributed online storage system Tair focuses on online access acceleration under super-large traffic. Alibaba Group uses it on a large scale, providing ultra-low latency response behind hundreds of millions of access requests per second. Scenarios include all kinds of online cache, in-memory database, high-performance persistent NoSQL database, etc., in the pursuit of high concurrency, fast response and high availability. Here, you will encounter various technical challenges, such as the peak time of 100 million-level access, the complex scenario requirements of different types of business, the operational support of 10,000 server clusters, business globalization and other technical challenges.

The author of this article

Mo Bing (Fu Qiulei), a technical expert in the Storage Technology Division of Alibaba Group, mainly focuses on distributed cache and NoSQL database.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.