Example Analysis of Linux File system and persistent memory 07/02 Update SLTechnology News&Howtos

Example Analysis of Linux File system and persistent memory

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article focuses on "Linux file system and persistent memory example analysis", interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Next let the editor to take you to learn "Linux file system and persistent memory example analysis" it!

In the Linux system, everything is a file, except for the files in the narrow sense, directories, devices, sockets and pipes are all files.

File systems have different meanings in different contexts:

A method of organizing files on a storage device, including data structures and access methods, to the storage device.

A piece of storage media formatted according to a file system type. We often say that the file system is mounted or unmounted in a directory, and that's what the file system here means.

The module in the kernel responsible for managing and storing files, that is, the file system module.

The architecture of the Linux file system is shown in the following figure, which is divided into three levels: user space, kernel space and hardware:

Note: the square alignment in the figure above, most of the time, we can't tell the difference between "cache" and "buffer" in the kernel file system. After all, both of them can be translated into "cache". But from the figure, we can clearly see that the so-called "cache" actually refers to the "page cache" in the picture, which is for files. Except for "DAX" (direct access device) it does not use "cache", other flash memory classes, block device devices will use "page cache", that is, "cache", and "buffer" actually refers to the "block cache" in the diagram, which is aimed at block devices.

1.1. Hardware level

External storage devices are divided into three categories: block devices, flash memory and NVDIMM devices. There are two main types of block devices:

Mechanical hard disk: the read and write unit of mechanical hard disk is sector. When accessing a mechanical hard disk, you need to first move the head along the radius to find the track, and then turn the disk to find the sector.

Flash block devices: flash memory is used as the storage medium, and the controller inside runs the solidified driver. One of the functions of the driver is the flash conversion layer (Flash Translation Layer,FTL), which converts the flash memory into a block device and appears as a block device. Common flash block devices are solid state drives (splid State Drives,SSD) used on personal computers and laptops, as well as embedded multimedia memory cards (embedded Multi Media Card,eMMc) and general purpose flash storage (Universal Flash Storage,UFS) used on mobile phones and tablets. The advantages of flash memory block devices over mechanical hard drives are: fast access speed, because there is no mechanical operation: high vibration resistance and easy to carry.

The main features of flash memory (Flash Memory) are as follows:

It is necessary to erase an erase block before writing data, because writing data to flash memory can only change one bit from 1 to 0, not from 0 to 1. The purpose of erasing is to set all bits of the erase block to 1.

The maximum number of erasure of an erase block is limited. The maximum erase number of erasure blocks of NOR flash memory is 10 ^ 4 ~ 10 ^ 3, and that of NAND flash memory is 10 ^ 3 ~ 10 ^ 6.

Flash memory is divided into NAND flash memory and NOR flash memory according to the storage structure, and the difference between the two is as follows:

The capacity of NOR flash memory is small, while the capacity of NAND flash memory is large.

NOR flash memory supports byte addressing and on-chip execution (eXecute In Place,XIP), and programs can be executed directly in flash memory without reading the program into memory; the minimum read and write unit of NAND flash memory is a page or sub-page, an erasure block is divided into multiple pages, and some NAND flash memory divides the page into multiple sub-pages.

NOR flash memory reads faster than NAND flash memory blocks, writes and erases slower than NAND flash memory.

There are no bad blocks in NOR flash memory; there are bad blocks in NAND flash memory, mainly because the cost of eliminating bad blocks is too high. NOR flash memory is suitable for storage programs and is generally used to store boot programs such as uboot programs; NAND flash memory is suitable for storing data.

Why is the file system specifically designed for flash memory? The main reasons are as follows:

There are bad blocks in NAND flash memory, and the software needs to identify and skip them.

It is necessary to achieve loss equalization (wear leveling). Loss equalization is to equalize the erasure times of all erasure blocks to prevent some erasure blocks from being damaged first.

The main differences between mechanical hard drives and NAND flash memory are as follows:

The minimum read and write unit of a mechanical hard disk is the sector, and the sector size is generally 512 bytes: the minimum read and write unit of NAND flash memory is a page or subpage.

Mechanical hard drives can write data directly: NAND flash memory needs to erase an erase block before writing data.

The service life of the mechanical hard disk is longer than that of the NAND flash memory: there is no limit on the number of writes to the sector of the mechanical hard disk: the erasure block of the NAND flash memory is limited.

The mechanical hard disk hides bad sectors, and the software does not need to deal with bad sectors: the bad blocks of NAND flash memory are visible to the software, and the software needs to deal with bad blocks.

NVDIMM (Nonn-Volatile DIMM, non-volatile memory: DIMM is the abbreviation of Dual-Inline-Memory-Modules, meaning dual in-line enclosure, is a specification of memory) devices integrate NAND flash memory, memory and supercapacitor together, access speed is as fast as memory, and data will not be lost after power outage. At the moment of a power outage, the supercapacitor provides power to transfer data in memory to NAND flash memory.

1.2. Kernel space level

You can see in the kernel directory fs that the kernel supports a variety of file system types. In order to provide a unified file operation interface for user programs, and to enable different file system implementations to coexist, the kernel implements an abstract layer called virtual file system (Virtual File System,VFS), also known as virtual file system switching (Virtual Filesystem Switch,VFS). File systems are divided into the following categories.

Block device file system, storage devices are mechanical hard disk and solid state hard disk and other block devices, the commonly used block device file system is EXT and btrfs. The EXT file system is an original Linux file system, and there are currently three ready-made versions: EXT [2-4].

Flash file system, the storage devices are NAND flash and NOR flash, and the commonly used flash file systems are JFFS2, (log flash file system version 2, Journalling Flash File System version2) and UBIFS (unordered block mirror file system, Unsorted Block Image File System). The files of the memory file system are in memory, and the files are lost after the power outage. The commonly used memory file system is tmpfs, which is used to create temporary files.

A pseudo file system is a fake file system, just to use the programming interface of the virtual file system. The commonly used pseudo file systems are as follows:

Sockfs, a file system that allows sockets (socket) to receive messages using the file-reading interface read and send messages using the file-writing interface write.

Proc file system, originally developed to export process information in the kernel to user space, and later extended to export any information in the kernel to user space, usually mounting the proc file system in the directory "proc".

Sysfs, which is used to export kernel device information to user space, usually mounts the sysfs file system in the directory "/ sys".

Hugetlbfs, which is used to implement standard giant pages.

The cgroup file system, the control group (control group cgroup) is used to control the resources of a set of processes, and the cgroup file system enables administrators to configure cgroup by writing files.

The cgroup2 file system, cgroup2 is the second version of cgroup, and the cgroup2 file system allows administrators to configure cgroup2 by using write files.

These file systems have their own characteristics:

Page caching: access to external storage devices is slow. To avoid accessing external storage devices every time a file is read or written, the file system module creates a cache in memory for each file, because the cache unit is a page, so it is called a page cache.

Block device layer: the access unit of a block device is a block, and the block size is an integral multiple of the sector size. The kernel implements a unified block device layer for all block devices.

Block cache: to avoid having to access block devices for every read and write, the kernel implements block caching, creating a block cache in memory for each block device. The units of caching are blocks, and block caching is based on page caching.

IO scheduler: when accessing a mechanical hard disk, moving the head to find tracks and sectors is time-consuming. If the read and write requests are sorted by sector number, the movement of the head can be reduced and the throughput can be improved. The IO scheduler is used to determine the order in which read and write requests are submitted, and provides a variety of scheduling algorithms for different usage scenarios: NOOP (No Operation), CFQ (fully Fair queuing, Complete Fair Queuing), and deadline (deadline). NOOP scheduling algorithm is suitable for flash memory block devices, while CFQ and deadline scheduling algorithms are suitable for mechanical hard disks.

Block device drivers: each block device needs to implement its own driver.

The kernel calls flash memory storage technology device (Memory Technology Device,MTD), which implements a unified MTD layer for all flash memory, and each flash memory needs to implement its own driver. For NVDIMM devices, the file system needs to implement DAX (Direct Access direct access: X stands for eXciting, meaningless, just to make the name look cool), bypass the page cache and block device layer, and map the memory in the NVDIMM device directly to the virtual address space of the process or kernel.

The libnvdimm subsystem supports three types of NVDIMM devices: NVDIMM devices in persistent memory (persistent memory,PMEM) mode, NVDIMM devices in block device (block,BLK) mode, and NVDIMM devices that support both PMEM and BLK access modes. The PMEM access mode takes the NVDIMM device as memory, and the BLK access mode regards the NVDIMM device as a block device. Each NVDIMM device needs to implement its own driver.

2. NVIDMM, the next generation storage technology

NVDIMM (Non-Volatile Dual In-line Memory Module) is a random access, non-volatile memory. Non-volatile memory means that data does not disappear even without power. As a result, the data can be maintained in the event of a power outage (unexpected power loss), a system crash and a normal shutdown. NVDIMM also indicates that it uses DIMM encapsulation, is compatible with standard DIMM slots, and communicates over a standard DDR bus. Considering its non-volatile, and compatible with the traditional DRAM interface, also known as Persistent Memory.

2.1. Category

Currently, according to the definition of the JEDEC Standardization Organization, there are three NVDIMM implementations. They are:

NVDIMM-N

It means to put both traditional DRAM and flash flash memory on the same module, and the computer can access the traditional DRAM directly. Byte addressing is supported as well as block addressing. By using a small backup power supply, enough power is provided for data to be copied from DRAM to flash memory in the event of a power outage, and then reloaded into DRAM when power is restored.

NVDIMM-N schematic diagram

The main way NVDIMM-N works is actually the same as traditional DRAM. So its delay is also in the order of 10 to the 1st nanosecond. Moreover, its capacity is limited by its volume, and it will not be improved compared with the traditional DRAM.

At the same time, the way it works determines that its flash part is not addressable, and the cost of using two media at the same time increases sharply, but NVDIMM-N provides a new concept of persistent memory for the industry. At present, there are many NVIMM-N-based products on the market.

NVDIMM-F

Refers to flash flash memory that uses DRAM's DDR3 or DDR4 bus. We know that SSD with NAND flash as the medium generally uses SATA,SAS or PCIe bus. The use of DDR bus can increase the maximum bandwidth and reduce the latency and overhead caused by the protocol to some extent, but only block addressing is supported.

The main way NVDIMM-F works is essentially the same as SSD, so its latency is on the order of 10 to the first microsecond. Its capacity can easily reach more than TB.

NVDIMM-P

This is a standard (Under Development) that has not yet been released and is expected to be released together with the DDR5 standard. According to the plan, DDR5 will provide double the bandwidth and improve channel efficiency than DDR4. These improvements, along with the user-friendly interface of the server and client platforms, will support high performance and improved power management in a variety of applications.

NVDIMM-P is actually a mix of real DRAM and flash. It supports both block addressing and byte addressing similar to traditional DRAM. It can not only achieve more than TB similar to NAND flash in capacity, but also keep the delay at 10 to the power of 2 nanoseconds.

By connecting the data media directly to the memory bus, CPU can access the data directly without any driver or PCIe overhead. And because memory access is through 64-byte cache line,CPU, you only need to access the data it needs, rather than block-by-block access each time, as an ordinary block device does.

Intel released the Intel ®Optane ™DC Persistent Memory based on 3D XPoint ™technology in May 2018. Can be thought of as an implementation of NVDIMM-P.

Intel ®Optane ™DC Persistent Memory

2.2, hardware support

Applications can access NVDIMM-P directly, just as with traditional DRAM. This also eliminates the need for page swapping between traditional block devices and memory. But writing data to persistent memory shares computer resources with writing data to normal DRAM. Including processor buffer, L1/L2 cache and so on.

It is important to note that to make the data persistent, make sure that the data is written to a persistent memory device or to buffer with power-down protection. If the software is to take full advantage of persistent memory, the instruction set architecture needs at least the following support:

The atomicity of writing

Means that writes of any size in persistent memory should be guaranteed to be atomic in case the system crashes or suddenly loses power. IA-32 and IA-64 processors guarantee write atomicity of data access (aligned or unaligned) to cached data up to 64 bits. As a result, the software can safely update data on persistent memory. This also brings a performance improvement because it eliminates the overhead of writing atomicity such as copy-on-write or write-ahead-logging.

Efficient cache refresh (flushing)

For performance reasons, persistent memory data must also be put into the processor's cache (cache) before it can be accessed. Optimized cache refresh instructions reduce the performance impact caused by CLFLUSH.

A. CLFLUSHOPT provides more efficient cache refresh instructions

B. the CLWB (Cache Line Write Back) instruction writes the changed data on the cache line back to memory (similar to CLFLUSHOPT), but instead of turning the cache line into an invalid state (invalid, MESI protocol), it transitions to an unchanged exclusive state (Exclusive). The CLWB directive is actually an attempt to reduce the inevitable cache miss for the next visit due to a cache line refresh.

Submit to persistent memory (Committing to Persistence)

In modern computer architecture, the completion of cache refresh indicates that the modified data has been written back to the write buffer of the memory subsystem. But the data is not persistent at this time. To ensure that data is written to persistent memory, the software needs to flush volatile write buffers or other caches in the memory subsystem. The new commit instruction PCOMMIT for persistent writes commits data from the write queue of the memory subsystem to persistent memory.

Optimization of non-temporary store operations (Non-temporal Store Optimization)

When software needs to copy large amounts of data from normal memory to persistent memory (or between persistent memory), weak sequential, non-temporary store operations can be used (such as using MOVNTI instructions). Because the Non-temporal store instruction implicitly invalidates the cache line to be written back, the software does not need to explicitly flush cache line (see Section 10.4.6.2. Of Intel ®64 and IA-32 Architectures Software Developer's Manual, Volume 1).

Summary

The above introduces several implementation methods of NVDIMM, as well as the hardware optimization and support to give full play to the performance of NVDIMM-P. The following will continue to introduce software support, including programming model, programming library, SPDK support, and so on.

In the previous introduction to NVDIMM, we explained several hardware implementations of NVDIMM and the hardware changes made to support and optimize performance. Next, let's discuss what software support has been done to take full advantage of the performance of NVDIMM. Some people may wonder, why is it so troublesome to use? Since it is persistent memory, isn't it all right to turn it off and turn it on? In fact, at present, this idea will not become a reality. Because besides DRAM is volatile, such as cache, registers are also volatile. Simply making memory persistent cannot achieve this goal. Another problem is memory leak. If there is a memory leak, just restart it. What if it's a persistent memory leak? This is also a very thorny problem. Pmem is similar to memory in some ways and storage in some ways. However, generally speaking, we don't think that Pmem can replace memory or storage. In fact, it can be seen as a supplement that fills the huge difference between memory and storage.

SPDK began to introduce support for Pmem in 17.10. The Pmem is exposed as a block device at the bdev layer of SPDK, using the fast device interface to communicate with the upper layer. This is shown in the following figure.

We can see from the figure that libpmemblk converts block operations to byte operations. How does it do that? Before we introduce libpmemblk and the PMDK behind it, let's take a look at the basics.

Mmap and DAX

First of all, let's take a look at the traditional Buffered O mode, that is, the cache Imax O (cache Imax O). The default IO operation mode of most operating systems is to cache IO. This mechanism caches IO data in the page cache of the operating system, that is, the data is copied into the buffer of the kernel space of the operating system before being copied from the buffer of the kernel space to the specified user address space.

In Linux, this way of accessing files is achieved through read/write system calls, as shown in the figure above. Next, let's compare the memory mapping IO mmap ().

Next, let's compare the memory mapping IO mmap ().

Get a pointer to the corresponding file through mmap, and then assign or memcpy/strcpy it just like memory. This is what we call the load/store operation (this operation usually requires msync and fsync to set up the disk).

Because mmap establishes the mapping relationship between files and user space, it can be regarded as copying files directly to user space, reducing one data copy. But mmap still depends on page cache.

After talking about mmap, what is DAX? DAX, or direct access, is based on mmap. The difference of DAX is that it does not need page cache at all and accesses the storage device directly, so it is made for NVDIMM. The application's file operations for mmap are synchronized directly to the NVDIMM. DAX is currently supported on NTFS of XFS, EXT4 and Windows. It is important to note that using this pattern, you need to make changes to the application or file system.

2.3 、 NVM Programming Model

NVM Programming Model roughly defines three ways to use it.

2.3.1 the leftmost Management manages NVDIMM mainly through the API provided by driver, such as checking capacity information, health status, firmware version, firmware upgrade, mode configuration, and so on.

2.3.2 in the middle, it is used as a storage fast device, using a file system and kernel that supports NVDIMM driver, the application does not need to make any modifications, and accesses NVDIMM through the standard file interface.

2.3.3 third, based on the DAX feature of the file system, there is no need for page cache through load/store operation, synchronous disk setting, no system calls and no interruption. This is also the core of NVM Programming Model, which can fully release the performance advantages of NVDIMM. But the downside is that the application may need to make a change.

PMDK

Libpmemblk implements an array of blocks of the same size that reside in pmem. Each block remains atomic transactional for sudden power outages, program crashes, and so on. Libpmemblk is based on the libpmem library, and libpmem is a lower-level library provided in PMDK, especially support for flush. It can track every store operation on pmem and ensure that the data is persistent.

In addition, PMDK provides other programming libraries, such as libpmemobj,libpmemlog,libvmmalloc. If you are interested, you can visit its home page for more information.

Conclusion

At this point, we all have a general understanding of the differences between NVDIMM hardware and software. Intel launched the Intel ®Optane ™DC Persistent Memory based on 3D XPoint ™technology in May 2018, triggering a NVDIMM tipping point.

2.4. The above contents can be summarized as follows

NVIDMM classification

NVIDMM-N:memory mapped DRAM, which provides character access interface, has the best performance and the smallest capacity among the three products.

NVDIMM-F:memory mapped Flush, which only provides block device interfaces. Nand Flush links directly to Memory controller channel.

NVIDMM-P:Under Development, which provides access to block devices and character devices.

Characteristics

NVDIMM-N:NVDIMM-N can be used both as a cache and as a block storage device. The typical representative is an intel-like AEP.

NVIDMM-F: unlike NVIDMM-N, which is mainly used for caching, NVIDMM-F is mainly used for storage. Can be used to quickly build high-density memory pool storage pools.

2.4.1 Building a file system based on NVDMM

The file system designed for PMEM is NOVA Filesystem. Interested readers can refer to NOVA's github.

ZUFS as a project from NetApp, the full name of ZUFS is Zero-copy User Filesystem. Claims to have implemented full zero-copy, and even the metadata of the file system is zero-copy. ZUFS is mainly designed for PMEM, but it can also support traditional disk devices, equivalent to the zero-copy version of FUSE, which is an improvement on the performance of FUSE.

In the mode used as DRAM:

2.4.2.1 support system-wide power-off protection. In many scenarios, in order to prevent abnormal power outage and data loss of commit and flush, the two-stage submission method of commit on write can be omitted.

2.4.2.2 provides a new storage layer between DRAM and SSD physics

2.4.2.3 because its access speed may be 1-3 orders of magnitude higher than that of SSD when used as DRAM, the dependence on page cach can be removed in some file systems, which can better control the aPCge latency and service stability of upper-layer business.

DAX: as the name implies, DAX is Direct Access, bypass page cache. Read and write directly manipulate the data on PMEM, and the file system needs to add the "- o dax" parameter to mount. DAX greatly improves the performance of the file system on PMEM devices, but there are still some issues that remain unresolved, such as:

The metadata of the file system still requires the use of page cache or buffer cache.

The "- o dax" mount option controls the entire file system and does not have more fine-grained control.

2.4.3 there is no API to tell the application whether the accessed files can be accessed by DAX.

3. The implementation of NVDIMM under Linux.

Persistent memory is a new type of computer storage, its speed is close to dynamic RAM (DRAM), but it also has the byte addressing ability of RAM and the performance of solid state hard disk (SSD). Like traditional RAM, persistent memory is installed directly in the memory slot on the motherboard. Therefore, it has the same physical form factor as RAM and is provided in the form of DIMM. This memory is called NVDIMM: non-volatile dual in-line memory modules.

However, unlike RAM, persistent memory is similar to flash-based SSD in many ways. The latter two are in the form of solid-state memory circuits, but in addition, both provide non-volatile storage: when the system is powered off or rebooted, the contents of the memory are retained. When using both media, writing data is slower than reading data; both support a limited number of rewrite cycles. Finally, as with SSD, this can be done if sector-level access to persistent memory is more appropriate in a particular application scenario.

Different models use different forms of electronic storage media, such as Intel 3D XPoint, or use NAND flash memory in combination with DRAM. In addition, the industry is developing new forms of non-volatile RAM. This means that different vendors and NVDIMM models provide different performance and persistence characteristics.

Because the storage technologies involved are in the early stages of development, hardware from different vendors may impose different restrictions. Therefore, the following description applies to general occasions.

Persistent memory is up to 10 times slower than DRAM, but about 1000 times faster than flash memory. Data can be rewritten in bytes, unlike in flash memory, where the entire sector needs to be erased and then rewritten. Although the number of rewrite cycles is limited, most forms of persistent memory can handle millions of rewrites, compared with flash memory that can handle only a few thousand cycles.

This has two important consequences: systems that contain only persistent memory cannot be run with the latest technology, so completely non-volatile main memory cannot be achieved, and a mixture of traditional RAM and NVDIMM must be used. Operating systems and applications will be executed in traditional RAM, while NVDIMM provides extremely fast supplementary storage.

Due to the different performance characteristics of persistent memory from different vendors, programmers may need to consider the hardware specifications of NVDIMM on a particular server, including the number of NVDIMM, and which memory slots they can fit into. Obviously, this will have an impact on the use of hypervisors, software migration between different hosts, and so on.

This new storage subsystem is defined in ACPI standard version 6. But libnvdimm supports NVDIMM before the standard, and the memory can be used in the same way.

3.1persistent memory (PMEM)

Like RAM, PMEM storage provides byte-level access. When using PMEM, a single namespace can contain multiple interlaced NVDIMM, so that these NVDIMM can be used as a single device. PMEM namespaces can be configured in two ways.

Use PMEM with DAX

After configuring the PMEM namespace for Direct Access (DAX), accessing memory bypasses the kernel's page overdrive cache and goes directly to the media. The software can read or write each byte of the namespace directly.

Use PMEM with BTT

As in traditional disk drives, sector-by-sector access is configured to run in PMEM namespaces in BTT mode, rather than byte addressing as in RAM. A translation table mechanism batches access activity into sector-sized units.

The advantage of BTT is that the storage subsystem ensures that each sector is fully written to the underlying media and unregisters a write operation if it fails for some reason. Therefore, partial writing cannot be done in a given sector. In addition, access to the BTT namespace is overcached by the kernel. The disadvantage is that the BTT namespace does not support DAX.

3.2. Tools for managing persistent memory

To manage persistent memory, the ndctl package must be installed. Installing this package also installs the libndctl package, which provides a set of user-space libraries for configuring NVDIMM. These tools are run through the libnvdimm library. The library supports three types of NVDIMM:

PMEM

BLK

Synchronize PMEM and BLK.

The ndctl utility provides a series of useful man pages that can be accessed using the following commands:

Ndctl help subcommand

To view a list of available subcommands, use:

Ndctl-list-cmds

The available subcommands are:

Version: displays the current version of the NVDIMM support tool.

Enable-namespace: makes the specified namespace available.

Disable-namespace: prevents the use of the specified namespace.

Create-namespace: creates a new namespace from the specified storage device.

Destroy-namespace: removes the specified namespace.

Enable-region: makes the specified area available.

Disable-region: prevents the use of the specified area.

Zero-labels: erases metadata from the device.

Read-labels: retrieves metadata for the specified device.

List: displays the available devices.

Help: displays information about tool usage.

3.3. Set persistent memory

3.3.1 View available NVDIMM storage

You can use the ndctl list command to list all available NVDIMM on the system. In the following example, the system contains three NVDIMM, which are located in a single three-channel interlaced set.

Ndctl list-- dimms [{"dev": "nmem2", "id": "8089-00-0000-12325476"}, {"dev": "nmem1", "id": "8089-00-0000-11325476"}, {"dev": "nmem0", "id": "8089-00-0000-10325476"}]

If you combine different parameters, ndctl list can also list available areas.

Note: areas may not be displayed in numerical order.

Notice that although there are only three NVDIMM, they are displayed as four areas.

Ndctl list-regions [{"dev": "region1", "size": 68182605824, "available_size": 68182605824, "type": "blk"}, {"dev": "region3", "size": 202937204736, "available_size": 202937204736, "type": "pmem", "iset_id": 5903239628671731251}, {"dev": "region0", "size": 68182605824 "available_size": 68182605824, "type": "blk"}, {"dev": "region2", "size": 68182605824, "available_size": 68182605824, "type": "blk"}]

Spaces are displayed in two different forms: three independent 64 GB regions of BLK type, or a merged 189GB region of PMEM type, which represents all spaces in three interlaced NVDIMM as a single volume.

Notice that the display value of available_size is the same as that of size. This means that no space has been allocated.

3.3.2 configure storage to use a single PMEM namespace of DAX

The first example configures three NVDIMM to use a single PMEM namespace of Direct Access (DAX). The first step is to create a new namespace.

Ndctl create-namespace-type=pmem-mode=fsdax-map=memory {"dev": "namespace3.0", "mode": "memory", "size": 199764213760, "uuid": "dc8ebb84-c564-4248-9e8d-e18543c39b69", "blockdev": "pmem3"}

This creates a block device / dev/pmem3 that supports DAX. The 3 in the device name inherits from the parent area number (in this case, region3).

The map=memory option sets a portion of the PMEM storage space from the NVDIMM so that it can be used to allocate internal kernel data structures called structured pages. This allows the new PMEM namespace to be used with features such as O_DIRECT Imax O and RDMA.

In the end, the capacity of the PMEM namespace is smaller than that of the parent PMEM region because some persistent memory is reserved for the kernel data structure.

Next, we verify that the new block device is available for the operating system:

Fdisk-1 / dev/pmem3 Disk / dev/pmem3: 186GiB, 199764213760 bytes, 390164480 sectors Units: sectors of 1 * 512512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I size (minimum/optimal): 4096 bytes / 4096 bytes

Like any other drive, the device must be formatted before it can be used. In this example, we use XFS to format it:

Mkfs.xfs / dev/pmem3

...

Next, you can mount the new drive to a directory:

Mount-o dax / dev/pmem3 / mnt/pmem3

You can then verify that you have a device that supports DAX:

Mount | grep dax / dev/pmem3 on / mnt/pmem3 type xfs (rw,relatime,attr2,dax,inode64,noquota)

As a result, we have a PMEM namespace formatted using the XFS file system with DAX installed.

Any mmap () call to a file in the file system returns a virtual address that maps directly to persistent memory on the NVDIMM and completely bypasses the page overdrive cache. Any fsync or msync calls to the files in the file system still ensure that the modified data is fully written to the NVDIMM. These calls refresh the associated processor overspeed cache rows for any page modified in user space through mmap mapping.

3.3.2.1 remove namespaces

Before creating any other type of volume that uses the same storage, we must unmount this PMEM volume and then remove it.

First unmount the volume:

Umount / mnt/pmem3

Then disable the namespace:

Ndctl disable-namespace namespace3.0 disabled 1 namespace

Then delete the volume:

Ndctl destroy-namespace namespace3.0 destroyed 1 namespace

3.3.3 create PMEM namespaces that use BTT

In the next example, we will create a PMEM namespace that uses BTT.

Ndctl create-namespace-- type=pmem-- mode=sector {"dev": "namespace3.0", "mode": "sector", "uuid": "51ab652d-7f20-44ea-b51d-5670454f8b9b", "sector_size": 4096, "blockdev": "pmem3s"}

Next, verify the existence of the new device:

Fdisk-1 / dev/pmem3s Disk / dev/pmem3s: 188.8 GiB, 202738135040 bytes, 49496615 sectors Units: sectors of 1 * 40964096 = 4096 bytes Sector size (logical/physical): 4096 bytes / 4096 bytes I size (minimum/optimal): 4096 bytes / 4096 bytes

Like the previously configured PMEM namespace that supports DAX, this PMEM namespace that supports BTT also takes up all available storage in NVDIMM.

Note: the trailing s in the device name (/ dev/pmem3s) indicates the sector (sector) and can be used to easily identify namespaces configured to use BTT.

Volumes can be formatted and mounted as described in the previous example.

The PMEM namespace shown here cannot use DAX, it uses BTT to provide sector write atomicity. Each time a sector write is made through the PMEM block driver, BTT allocates a new sector to receive new data. After the new data is fully written, BTT updates its internal mapping structure atomically, making the newly written data available to the application. If a power failure occurs at any point in the process, the writes will be completely lost, in which case the application can access its old data, which remains unchanged. This prevents the so-called "sector tearing" situation.

Like any other standard block device, you can use a file system to format this BTT-enabled PMEM namespace and use it in that file system. Cannot use this namespace with DAX. However, the mmap mapping of files in this block device will use page overcaching.

Use memory (DRAM) to simulate persistent memory (Persistent Memory)

3.4.1 Lite version: the general kernel only needs two steps to simulate persistent memory.

1) configure grub:

Vim / etc/default/grub

Add the following statement, the former is the size to be simulated, and the second is the location where the simulated persistence memory starts in memory. That is, starting with 4G of memory, 32G is divided to simulate persistent memory.

GRUB_CMDLINE_LINUX= "memmappings 32G4G"

2) Update grub

Update-grub & & reboot

3.4.2 Deep Analysis

At present, the real persistent memory is not available to ordinary users, and it may be necessary to simulate persistent memory for use in experiments and tests. now we test to divide a memory area on a host to simulate persistent memory.

Environment: Ubuntu 18.04, an ordinary Dell desktop, running 8G of memory.

The Linux kernel has had support for persistent memory devices and emulation since Linux 4. 0, but for ease of configuration, a newer kernel than 4. 2 is recommended. In the kernel, an environment that supports PMEM is created using the DAX extension to the file system. Some distributions, such as Fedora 24 and later, have built-in DAX/PMEM support.

To see if the kernel supports DAX and PMEM, you can use the following command:

# egrep'(DAX | PMEM)'/ boot/config- `uname-r`

If there is built-in support, it will output something like the following:

CONFIG_X86_PMEM_LEGACY_DEVICE=y CONFIG_X86_PMEM_LEGACY=y CONFIG_BLK_DEV_RAM_DAX=y CONFIG_BLK_DEV_PMEM=m CONFIG_FS_DAX=y CONFIG_FS_DAX_PMD=y CONFIG_ARCH_HAS_PMEM_API=y

Unfortunately, our Ubuntu 18.04 does not have built-in support for DAX/PMEM, so entering the above command does not have any output. Next, simulate persistent memory on Ubuntu 18.04. Since DAX and PMEM are not supported by default on Ubuntu 18.04, we need to recompile the kernel and add the relevant settings to the configuration options for compiling the kernel.

Recompile the kernel here, and the chosen version is Linux-4.15.

First enter the command:

Make nconfig

Go to the following configuration interface to configure PMEM and DAX

Device Drivers NVDIMM Support PMEM; BLK; BTT NVDIMM DAX

Configure PMEM

First go to Device Drivers, find NVDIMM Support in Device Drivers, you need to turn down the menu bar, the content is not just the first page we see, NVDIMM Support is not on the first page.

Enter the NVDIMM Support and select all the contents:

PMEM; BLK; BTT NVDIMM DAX

Configure the file system DAX

Use esc to return to the initial page of make nconfig

File System Direct Access support

Processor property settin

Use esc to return to the initial page of make nconfig

Processor type and features Support non-standard NVDIMMs and ADR protected memory

In fact, all the above processes have been done by default in Linux-4.15, that is, I only need make nconfig.

After all this is configured, the kernel is compiled and installed:

# make-j9 # make modules_install install

Then enter into the newly compiled kernel Linux-4.15

Print out the e820 table using the following command:

Dmesg | grep e820

Get the following content:

[0.000000] e820: BIOS-provided physical RAM map: [0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009d7ff] usable [0.000000] BIOS-e820: [mem 0x000000000009d800-0x000000000009ffff] reserved.

The above usable can be used by us, from which we can divide some areas as our persistent memory. It is recommended to choose:

[0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000021f5fffff] usable

0x0000000100000000 is 4G, and you need to configure grub to set it:

Vim / etc/default/grub

I directly configure 4G space in it to simulate persistent memory. Add the following statement in grub to indicate that the space size is 4G, and the memory space starting from 4G memory is used to simulate persistent memory:

GRUB_CMDLINE_LINUX= "memmappings 4G4G"

Once configured, update grub:update-grub

Use the following command to see if it is successful:

Dmesg | grep user

As you can see, this area has been simulated to persist memory, and then we can see the pmem0 device in the host / dev directory, so we can use the simulated persistent memory.

How to use it-- establish a DAX file system

Take the ext4 file system as an example

Mkdir / mnt/pmemdir mkfs.ext4 / dev/pmem0 mount-o dax / dev/pmem0 / mnt/pmemdir

This mounts the directory / mnt/pmem to persistent memory, which will be used later in use.

Reference source: how to emulate persistent memory on Intel ®architecture servers

3.4.3 use the memmap kernel option

The pmem driver allows users to use EXT4 and XFS based on direct access to the file system (DAX). A new memmap option has been added that allows one or more ranges of unallocated memory to be reserved for persistent memory for simulation. The memmap parameter documentation is on the relevant pages of the Linux kernel. This feature is extended upward in the v4.0 kernel. Kernelv4.15 introduces performance improvements and is recommended for production environments.

The memmap option uses the memmap= n [KMG]! ss [KMG] format; where nn is the size of the area to be preserved, ss is the starting offset, and [KMG] specifies the size in kilobytes, megabytes, or gigabytes. Configuration options are passed to the kernel through GRUB, and changing GRUB menu items and kernel parameters vary from Linux distribution to Linux distribution. Here are some instructions for common Linux distributions. For more information, see the documentation for the Linux distribution and version you are using.

The memory area will be marked as e820 type 12 (0xc), which is visible at boot time. Use the dmesg command to view these messages.

$dmesg | grep e820

'memmappings 4G '12GB in the GRUB configuration: reserves 4GB memory, from 12GB to 16GB. For more information, see how to choose the correct memmap option for your system. Each Linux distribution has different ways to modify the GRUB configuration, just follow the documentation of the distribution. Here are some common distributions for quick reference.

1), Ubuntu

$sudo vim / etc/default/grub GRUB_CMDLINE_LINUX= "memmappings 4Glob12G"

Restart the machine after updating the grub

$sudo update-grub2

2), RHEL

$sudo vi / etc/default/grub GRUB_CMDLINE_LINUX= "memmappings 4Glob12G"

Officially start updating grub configuration

On BIOS-based machines:

$sudo grub2-mkconfig-o / boot/grub2/grub.cfg

On UEFI-based machines:

$sudo grub2-mkconfig-o / boot/efi/EFI/centos/grub.cfg

Multiple configurations can be used, and two 2G namespaces have been established below

Will create two 2GB namespaces, one in the 12GB-14GB memory address offsets, the other at 14GB-16GB.

After the host restarts, there should be a new / dev/pmem {N} device, one for each memmap zone specified in the GRUB configuration. These can be displayed using ls/dev/pmem*, with naming conventions starting with / dev/pmem0 and incrementing for each device. The / dev/pmem {N} device can be used to create a DAX file system.

Use the / dev/pmem device to create and mount the file system, and then verify that the dax flag is set for the mount point to confirm that the dax feature is enabled. The following shows how to create and mount an EXT4 or XFS file system.

1), XFS

Mkfs.xfs / dev/pmem0 mkdir / pmem & & mount-o dax / dev/pmem0 / pmem mount-v | grep / pmem / dev/pmem0 on / pmem type xfs (rw,relatime,seclabel,attr2,dax,inode64,noquota)

2), EXT4

Mkfs.ext4 / dev/pmem0 mkdir / pmem & & mount-o dax / dev/pmem0 / pmem mount-v | grep / pmem / dev/pmem0 on / pmem type ext4 (rw,relatime,seclabel,dax,data=ordered)

How to choose the correct memmap option for the system

When selecting values for the memmap kernel parameter, you must consider that the start and end addresses represent the available RAM. Using or overlapping with reserved memory can lead to corrupted or undefined behavior, which is easily available in the e820 table through dmesg.

The following example server has 16GiB memory, and the available memory is between 4GiB (0x100000000) and ~ 16GiB (0x3ffffffff):

$dmesg | grep BIOS-e820 [0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable [0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved [0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved [0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000bffdffff] usable [0.000000] BIOS-e820: [mem 0x00000000bffe0000-0x00000000bfffffff] reserved [0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved [0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved [0.000000] BIOS-e820: [mem 0x0000000100000000-0x00000003ffffffff] usable

To reserve the 12GiB free space between 4GiB and 16GiB as simulated persistent memory, the syntax is as follows:

Memmappings 12G4G

A new user-defined e820 entry display range after restart is now "persistent (type12)":

$dmesg | grep user: [0.000000] user: [mem 0x0000000000000000-0x000000000009fbff] usable [0.000000] user: [mem 0x000000000009fc00-0x000000000009ffff] reserved [0.000000] user: [mem 0x00000000000f0000-0x00000000000fffff] reserved [0.000000] user: [mem 0x0000000000100000-0x00000000bffdffff] usable [0.000000] user: [mem 0x00000000bffe0000-0x00000000bfffffff] reserved [0.000000] user: [mem 0x00000000feffc000-0x00000000feffffff] reserved [0.000000] user: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved [ 0.000000] user: [mem 0x0000000100000000-0x00000003ffffffff] persistent (type 12)

Fdisk or lsblk programs can be used to display capacity, for example:

# fdisk-1 / dev/pmem0 Disk / dev/pmem0: 12 GiB, 12884901888 bytes, 25165824 sectors Units: sectors of 1 * 512512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I size O size (minimum/optimal): 4096 bytes / 4096 bytes# lsblk / dev/pmem0 NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT pmem0 259 012G 0 disk / pmem

Note: most Linux distributions enable kernel address space layout randomization (KASLR), which is defined by CONFIG_RANDOMIZE_BASE. When enabled, the kernel may use memory previously reserved for persistent memory without warning, resulting in corrupted or undefined behavior, so it is recommended to disable KASLR on 16GiB or lower systems. For more information, refer to the corresponding Linux distribution documentation, as each distribution is different.

At this point, I believe that everyone on the "Linux file system and persistent memory example analysis" have a deeper understanding, might as well to the actual operation of it! Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.