In-depth analysis of the principle of Linux IO and the implementation of several zero-copy mechanisms 07/01 Update SLTechnology News&Howtos

In-depth analysis of the principle of Linux IO and the implementation of several zero-copy mechanisms

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

Preface

Zero copy (Zero-copy) technology means that when the computer performs operations, CPU does not need to copy data from one memory area to another, thus reducing context switching and CPU copy time. Its function is in the process of transferring Datagram from network equipment to user program space, reduce the number of data copies, reduce system calls, achieve zero participation in CPU, and completely eliminate the load of CPU in this aspect. The most important technologies used to realize zero copy are DMA data transmission technology and memory region mapping technology.

The zero-copy mechanism can reduce the repeated Icano copy operations between the kernel buffer and the user process buffer. Zero-copy mechanism can reduce the CPU overhead caused by context switching between user process address space and kernel address space. Text 1. Physical memory and virtual memory

Because CPU and memory resources are shared between processes in the operating system, a set of perfect memory management mechanism is needed to prevent memory leaks between processes. In order to manage memory more effectively and reduce errors, modern operating systems provide an abstract concept of main memory, namely virtual memory (Virtual Memory). Virtual memory provides a consistent, private address space for each process, giving each process the illusion that it is enjoying its own main memory (each process has a continuous and complete memory space).

1.1. Physical memory

Physical memory (Physical memory) is relative to virtual memory (Virtual Memory). Physical memory refers to the memory space obtained through physical memory bars, while virtual memory refers to the division of an area of the hard disk as memory. The main function of memory is to provide temporary storage for the operating system and various programs while the computer is running. In the application, naturally, as the name implies, physically, the real size of the memory stick inserted on the memory slot on the motherboard.

1.2. Virtual memory

Virtual memory is a technology of memory management in computer system. It makes the application think that it has continuous available memory (a contiguous and complete address space). In fact, virtual memory is usually divided into multiple physical memory fragments, and some are temporarily stored on external disk storage, where data is exchanged and loaded into physical memory when needed. At present, most operating systems use virtual memory, such as virtual memory of Windows system, swap space of Linux system, and so on.

The virtual memory address is closely related to the user process. Generally speaking, the physical address of the same virtual address in different processes is different, so it is meaningless to talk about virtual memory without the process. The size of the virtual address that each process can use is related to the number of CPU bits. On 32-bit systems, the virtual address space size is 2 ^ 32 = 4G, on 64-bit systems, the virtual address space size is 2 ^ 64 = 16G, and the actual physical memory may be much smaller than the virtual memory size. Each user process maintains a separate page table (Page Table) through which virtual memory and physical memory are mapped to address space. The following is a schematic diagram of the address mapping between the virtual memory space and the corresponding physical memory of two processes An and B:

Cdn.xitu.io/2019/9/9/16d157b8701e49a4?w=1463&h=808&f=png&s=74624 ">

When a process executes a program, it needs to read the instruction of the process from memory first, and then execute it, and the virtual address is used to get the instruction. This virtual address is determined when the program is linked (the kernel adjusts the address range of the dynamic library when the kernel loads and initializes the process). In order to obtain the actual data, CPU needs to convert the virtual address into physical address. When CPU translates the address, it needs to use the process's page table (Page Table), and the data in the page table (Page Table) is maintained by the operating system.

The Page Table can be simply understood as a linked list of a single memory map (Memory Mapping) (of course, the actual structure is very complex), in which each memory map (Memory Mapping) maps a virtual address to a specific address space (physical memory or disk storage space). Each process has its own page table (Page Table), which has nothing to do with the page table (Page Table) of other processes.

Through the above introduction, we can simply summarize the process that the user process requests and accesses physical memory (or disk storage space) as follows:

The user process sends a memory request to the operating system and the system checks whether the virtual address space of the process is used up. If there is any left, the process is assigned the memory map (Memory Mapping) created by the virtual address system for the virtual address and put it into the process's page table (Page Table). The system returns the virtual address to the user process. The user process begins to access the virtual address CPU. According to the virtual address, the corresponding memory map (Memory Mapping) is found in the process's page table (Page Table), but this memory map (Memory Mapping) is not associated with physical memory, so the operating system receives the page fault interrupt, allocates the real physical memory and associates it to the corresponding memory map (Memory Mapping) of the page table. After the interrupt processing is completed, the CPU can access the memory. Of course, page fault interrupts do not occur every time, but only when the system feels it is necessary to delay the allocation of memory, that is, in step 3 above, the system will allocate real physical memory and associate it with memory mapping (Memory Mapping).

Introducing virtual memory between user processes and physical memory (disk storage) has the following main advantages:

Address space: provides a larger address space, and the address space is continuous, which makes programming and linking easier, process isolation: there is no relationship between the virtual addresses of different processes, so the operation of one process will not affect data protection for other processes: each block of virtual memory has corresponding read and write attributes, so it can protect the code segments of the program from being modified, data blocks cannot be executed, and so on. Increased security memory mapping for the system: with virtual memory, files on disk (executable files or dynamic libraries) can be mapped directly to the virtual address space. In this way, we can achieve the delayed allocation of physical memory, loading it into memory from disk only when the corresponding file needs to be read, and emptying this part of memory when memory is tight, improving the efficiency of physical memory utilization, and all of these shared memory that are transparent to applications: for example, dynamic libraries only need to store one copy in memory. Then map it to the virtual address space of different processes, making the process feel that it has exclusive ownership of the file. Memory sharing between processes can also achieve shared physical memory management by mapping the same physical memory to different virtual address spaces of processes: all physical address spaces are managed by the operating system, and processes cannot be allocated and recycled directly. as a result, the system can make better use of memory and balance the memory requirements between processes. 2. Kernel space and user space

The core of the operating system is the kernel, which is independent of ordinary applications and can access protected memory space as well as the underlying hardware devices. In order to prevent user processes from directly operating the kernel and ensure kernel security, the operating system divides virtual memory into two parts, one is kernel space (Kernel-space) and the other is user space (User-space). In the Linux system, the kernel module runs in the kernel space and the corresponding process is in the kernel state, while the user program runs in the user space and the corresponding process is in the user state.

The proportion of virtual memory occupied by kernel processes and user processes is 1:3, while the addressing space (virtual storage space) of the Linux x86room32 system is 4G (2 to the 32th power). The highest 1G bytes (from virtual address 0xC0000000 to 0xFFFFFFFF) are used by kernel processes, called kernel space, while the lower 3G bytes (from virtual address 0x00000000 to 0xBFFFFFFF) are used by individual user processes, called user space. The following figure shows the memory layout of the user space and kernel space of a process:

2.1. Kernel space

Kernel space always resides in memory, which is reserved for the kernel of the operating system. Applications are not allowed to read and write directly in this area or directly call functions defined by kernel code. The left area of the above figure is the virtual memory corresponding to the kernel process, which can be divided into two areas: private and shared by the process according to the access rights.

Process private virtual memory: each process has its own kernel stack, page table, task structure, mem_map structure, and so on. Virtual memory shared by processes: a memory area shared by all processes, including physical memory, kernel data, and kernel code areas. 2.2. User space

Each ordinary user process has a separate user space, and the process in the user mode cannot access the data in the kernel space and cannot directly call the kernel function, so when it is necessary to make a system call, you have to switch the process to the kernel state. User space includes the following areas of memory:

Runtime stack: automatically released by the compiler, storing function parameter values, local variables and method return values, etc. Whenever a function is called, the return type of the function and some information about the call are stored at the top of the stack, and the call information is popped up and memory is freed after the call. The stack area grows from high address bits to low address bits, and it is a continuous internal area. The maximum capacity is pre-defined by the system. When the applied stack space exceeds this limit, it will prompt overflow, and the space that users can get from the stack is small. Run-time heap: used to store dynamically allocated memory segments of a process, the address bits between the BSS and the stack. The card issuer applies for allocation (malloc) and release (free). The heap grows from low address bits to high address bits and uses a chain storage structure. Frequent malloc/free causes discontinuity of memory space, resulting in a large number of fragments. When applying for heap space, the library function searches for enough available space according to a certain algorithm. Therefore, the efficiency of the heap is much lower than that of the stack. Code snippet: stores machine instructions that can be executed by CPU. This part of memory can only be read but not written. Usually the code area is shared, that is, it can be called by other executors. If there are several processes running the same program on the machine, they can use the same code snippet. Uninitialized data segments: stores uninitialized global variables, and BSS data is initialized to 0 or NULL before the program starts execution. Initialized data segments: stores initialized global variables, including static global variables, static local variables, and constants. Memory mapping area: for example, the memory of virtual space such as dynamic library and shared memory is mapped to the memory of physical space, which is usually the virtual memory space allocated by mmap function. 3. Internal hierarchical structure of Linux

The kernel state can execute any command and call all the resources of the system, while the user state can only perform simple operations and can not directly call the system resources. The user mode must pass through the system interface (System Call) to issue instructions to the kernel. For example, when a user process starts a bash, it initiates a system call to the kernel's pid service through getpid () to get the ID; of the current user process. When the user process uses the cat command to view the host configuration, it initiates a system call to the kernel's file subsystem.

The kernel space has access to all CPU instructions and all memory space, Imax O space, and hardware devices. User space can only access restricted resources, and if special permissions are required, the corresponding resources can be obtained through system calls. User space allows page interrupts, while kernel space does not. Kernel space and user space are for linear address space. In x86 CPU, the user space is in the address range of 0-3G, and the kernel space is in the address range of 3G-4G. The x86room64 CPU user space address range is 0x0000000000000000-0x00007fffffffffff, and the kernel address space is 0xffff880000000000-maximum address. All kernel processes (threads) share one address space, while user processes have their own address space.

With the division of user space and kernel space, the internal hierarchy of Linux can be divided into three parts, including hardware, kernel space and user space from the bottom to the top, as shown in the following figure:

4. Linux ID O read and write mode

Linux provides three data transfer mechanisms between disk and main memory: polling, Ihammer O interrupt and DMA transfer. Among them, the polling method is based on the endless loop to continuously detect the Imax O port. The interrupt mode means that when the data arrives, the disk initiates an interrupt request to CPU, and the CPU itself is responsible for the data transmission process. The DMA transmission introduces the DMA disk controller on the basis of the IWeiO interrupt, and the DMA disk controller is responsible for the data transmission, which reduces the large consumption of CPU resources caused by the IWeiO interrupt operation.

4.1. Ipaw O interrupt principle

Before the advent of DMA technology, the Imax O operation between the application and the disk was done through the interrupt of CPU. Every time the user process reads disk data, it needs a CPU interrupt, and then initiates an Imax O request to wait for the data to be read and copied. Each Imax O interrupt leads to the context switch of the CPU.

The user process initiates the read system call to CPU to read the data, changes from the user mode to the kernel state, and then blocks waiting for the data to return. After receiving the instruction, CPU initiates an I _ user O request to the disk and puts the disk data into the disk controller buffer first. After the data preparation is complete, the disk initiates an Icano interrupt to the CPU. CPU copies the data from the disk buffer to the kernel buffer after receiving the I _ stop O interrupt, and then copies it from the kernel buffer to the user buffer. The user process switches from kernel mode to user mode, unblocks the state, and then waits for the next execution clock of CPU. 4.2. DMA transmission principle

The full name of DMA is Direct memory access (Direct Memory Access), which is a mechanism that allows peripherals (hardware subsystems) to access the main memory of the system directly. In other words, based on the DMA access mode, the data transmission of the system main memory between the hard disk or network card can bypass the whole scheduling of CPU. At present, most hardware devices, including disk controllers, network cards, graphics cards and sound cards, support DMA technology.

The whole data transmission operation is carried out under the control of a DMA controller. In addition to doing a little processing at the beginning and end of the data transfer (interrupt handling at the beginning and end), CPU can continue to do other work during the transfer. In this way, in most of the time, the CPU calculation and the Imax O operation are operated in parallel, which greatly improves the efficiency of the whole computer system.

After the DMA disk controller takes over the data read and write request, CPU is freed from the onerous Imax O operation. The process of the data read operation is as follows:

The user process initiates the read system call to CPU to read the data, changes from the user mode to the kernel state, and then blocks waiting for the data to return. After receiving the instruction, CPU initiates a scheduling instruction to the DMA disk controller. The DMA disk controller initiates the Iramo request to the disk and puts the disk data into the disk controller buffer first. CPU does not participate in this process. After the data is read, the DMA disk controller receives a notification from the disk and copies the data from the disk controller buffer to the kernel buffer. The DMA disk controller signals to CPU that the data has been read, and CPU is responsible for copying the data from the kernel buffer to the user buffer. The user process switches from kernel mode to user mode, unblocks the state, and then waits for the next execution clock of CPU. 5. Traditional Ipaw O mode

In order to better understand the problem solved by zero copy, let's first understand the problems existing in the traditional Icano method. In the Linux system, the traditional access method is achieved through two system calls, write () and read (). The file is read into the cache by the read () function, and then the cached data is output to the network port by the write () method. The pseudo code is as follows:

Read (file_fd, tmp_buf, len); write (socket_fd, tmp_buf, len)

The following figure corresponds to the data read and write process of the traditional DMA O operation, which involves 2 CPU copies, 2 DMA copies for a total of 4 copies, and 4 context switches. The related concepts are briefly described below.

Context switching: when the user program initiates a system call to the kernel, CPU switches the user process from the user mode to the kernel state; when the system call returns, CPU switches the user process from the kernel state to the user state. CPU copy: the CPU directly handles the transmission of data, and the data copy will always take up the resources of CPU. DMA copy: the CPU sends instructions to the DMA disk controller and lets the DMA controller handle the data transmission. After the data transfer, the information is fed back to the CPU, thus reducing the share of CPU resources. 5.1. Traditional read operation

When an application executes a read system call to read a piece of data, if the data already exists in the page memory of the user process, the data is read directly from the memory; if the data does not exist, the data is first loaded from disk into the kernel space read cache (read buffer), and then copied from the read cache to the page memory of the user process.

Read (file_fd, tmp_buf, len)

Based on the traditional read O reading method, the read system call triggers 2 context switches, 1 DMA copy and 1 CPU copy. The process for initiating data reading is as follows:

The user process initiates a system call to the kernel (kernel) through the read () function, and the context changes from the user state (user space) to the kernel state (kernel space). CPU uses a DMA controller to copy data from main memory or hard disk to a read buffer (read buffer) in kernel space (kernel space). CPU copies the data in the read buffer (read buffer) to the user buffer (user buffer) in user space (user space). The context is switched from kernel state (kernel space) to user state (user space), and the read call executes the return. 5.2. Traditional write operation

When the application prepares the data and executes the write system call to send the network data, it first copies the data from the page cache in the user space to the network buffer (socket buffer) in the kernel space, and then copies the data in the write cache to the network card device to complete the data transmission.

Write (socket_fd, tmp_buf, len)

Based on the traditional write O write mode, the write () system call triggers 2 context switches, 1 CPU copy and 1 DMA copy. The process for the user program to send network data is as follows:

The user process initiates a system call to the kernel (kernel) through the write () function, and the context changes from the user state (user space) to the kernel state (kernel space). CPU copies the data from the user buffer (user buffer) to the network buffer (socket buffer) in kernel space (kernel space). CPU uses the DMA controller to copy data from the network buffer (socket buffer) to the network card for data transmission. The context switches from kernel state (kernel space) to user state (user space), and the write system call executes the return. 6. Zero copy mode

There are three main ways to realize the zero-copy technology in Linux: direct Imax O in user mode, reduction of data copy times and write-time replication technology.

User mode direct Ihamo: the application can directly access the hardware storage, and the operating system kernel only assists data transmission. In this way, there is still a context switch between user space and kernel space, and the data on the hardware is copied directly to user space without going through kernel space. Therefore, there is no copy of data between the kernel space buffer and the user space buffer. Reduce the number of data copies: in the process of data transmission, avoid the CPU copy of the data between the user space buffer and the system kernel space buffer, as well as the CPU copy of the data in the system kernel space, which is also the implementation idea of the current mainstream zero copy technology. Write-time replication technology: when multiple processes share the same piece of data, if one of the processes needs to modify the data, copy it to its own process address space. If it is only a data read operation, no copy operation is required. 6.1. User mode direct Imax O

The user mode direct isign O makes the application process or the library function running in the user mode (user space) directly access the hardware device, and the data is transferred directly through the kernel. The kernel does not participate in any other work except the necessary virtual storage configuration in the process of data transmission. This way can directly bypass the kernel and greatly improve the performance.

User mode direct Ihamo can only be applied to applications that do not need kernel buffer processing. These applications usually have their own data caching mechanism in the process address space, which is called self-caching application, such as database management system. Secondly, this zero-copy mechanism will directly operate disk Imax O, because of the execution time gap between CPU and disk Imax O, it will cause a lot of waste of resources, the solution is to cooperate with asynchronous Imax O use.

6.2. Mmap + write

One zero-copy method is to use mmap + write instead of the original read + write mode, which reduces one CPU copy operation. Mmap is a memory mapping file method provided by Linux, which maps a virtual address in the address space of a process to a disk file address. The pseudo code of mmap + write is as follows:

Tmp_buf = mmap (file_fd, len); write (socket_fd, tmp_buf, len)

The purpose of using mmap is to map the address of the kernel read buffer (read buffer) to the user space buffer (user buffer), thus realizing the sharing of the kernel buffer and the application memory, eliminating the process of copying data from the kernel read buffer (read buffer) to the user buffer (user buffer). However, the kernel read buffer (read buffer) still needs to transfer the data to the kernel write buffer (socket buffer). The general process is shown in the following figure:

Based on the zero-copy mode of mmap + write system call, there will be 4 context switches, 1 CPU copy and 2 DMA copies in the whole copying process. The process for user programs to read and write data is as follows:

The user process initiates a system call to the kernel (kernel) through the mmap () function, and the context changes from the user state (user space) to the kernel state (kernel space). The read buffer (read buffer) of the kernel space of the user process is mapped to the cache area (user buffer) of the user space. CPU uses a DMA controller to copy data from main memory or hard disk to a read buffer (read buffer) in kernel space (kernel space). The context switches from kernel state (kernel space) to user state (user space), and the mmap system call executes the return. The user process initiates a system call to the kernel (kernel) through the write () function, and the context changes from the user state (user space) to the kernel state (kernel space). The network buffer (socket buffer) to which CPU copies the data in the read buffer (read buffer). CPU uses the DMA controller to copy data from the network buffer (socket buffer) to the network card for data transmission. The context switches from kernel state (kernel space) to user state (user space), and the write system call executes the return.

The main use of mmap is to improve the performance of Istroke O, especially for large files. For small files, memory-mapped files will lead to a waste of fragmented space, because memory mapping will always align page boundaries, the minimum unit is 4 KB, a 5-KB file will be mapped to take up 8 KB of memory, which will waste 3 KB of memory.

Although the copy of mmap is reduced by 1 copy and the efficiency is improved, there are some hidden problems. When mmap a file, if the file is intercepted by another process, the write system call will be terminated by SIGBUS signal because of accessing the illegal address. SIGBUS will kill the process and generate a coredump by default, and the server may be terminated as a result.

6.3. Sendfile

Sendfile system calls were introduced in Linux kernel version 2.1 to simplify the process of data transfer between the two channels over the network. The introduction of sendfile system call not only reduces the number of CPU copies, but also reduces the number of context switching. Its pseudo code is as follows:

Sendfile (socket_fd, file_fd, len)

Through the sendfile system call, the data can be transferred directly in the kernel space, thus eliminating the need to copy the data back and forth between the user space and the kernel space. Unlike the mmap memory mapping approach, the Imax O data in sendfile calls is completely invisible to user space. In other words, this is a complete sense of data transmission process.

Based on the zero-copy mode of sendfile system call, there will be 2 context switches, 1 CPU copy and 2 DMA copy in the whole copying process. The process for user program to read and write data is as follows:

The user process initiates a system call to the kernel (kernel) through the sendfile () function, and the context changes from the user state (user space) to the kernel state (kernel space). CPU uses a DMA controller to copy data from main memory or hard disk to a read buffer (read buffer) in kernel space (kernel space). The network buffer (socket buffer) to which CPU copies the data in the read buffer (read buffer). CPU uses the DMA controller to copy data from the network buffer (socket buffer) to the network card for data transmission. The context switches from kernel state (kernel space) to user state (user space), and the sendfile system call executes the return.

Compared to mmap memory mapping, sendfile has 2 fewer context switches, but still has 1 CPU copy operation. The problem with sendfile is that the user program cannot modify the data, but simply completes a data transmission process.

6.4. Sendfile + DMA gather copy

The kernel of Linux 2.4 modifies sendfile system calls, introducing gather operations for DMA copies. It records the corresponding data description information (memory address, address offset) in the read buffer (read buffer) of the kernel space (kernel space) to the corresponding network buffer (socket buffer), and the DMA copies the data from the read buffer (read buffer) to the network card device in batches according to the memory address and address offset, thus omitting the only one CPU copy operation left in the kernel space. The pseudo code of sendfile is as follows:

Sendfile (socket_fd, file_fd, len)

With the support of hardware, the sendfile copy mode is no longer copied from the data of the kernel buffer to the socket buffer, but only the copy of the buffer file descriptor and data length, so that the DMA engine can directly use the gather operation to package and send the data in the page cache to the network, which is essentially similar to the idea of virtual memory mapping.

Based on the zero copy mode of sendfile + DMA gather copy system call, there will be 2 context switches, 0 CPU copies and 2 DMA copies in the whole copy process. The process of reading and writing data by the user program is as follows:

The user process initiates a system call to the kernel (kernel) through the sendfile () function, and the context changes from the user state (user space) to the kernel state (kernel space). CPU uses a DMA controller to copy data from main memory or hard disk to a read buffer (read buffer) in kernel space (kernel space). CPU copies the file descriptor (file descriptor) and data length of the read buffer (read buffer) to the network buffer (socket buffer). Based on the copied file descriptor (file descriptor) and the data length, CPU uses the gather/scatter operation of the DMA controller to copy the data directly from the kernel read buffer (read buffer) to the network card for data transfer. The context switches from kernel state (kernel space) to user state (user space), and the sendfile system call executes the return.

Sendfile + DMA gather copy copy mode also has the problem that the user program can not modify the data, and it needs the support of hardware. It is only suitable for the transfer process of copying data from files to socket sockets.

6.5. Splice

Sendfile is only suitable for copying data from files to socket sockets, and requires hardware support, which limits its scope of use. Linux introduced splice system call in version 2.6.17, which not only does not need hardware support, but also implements zero copy of data between two file descriptors. The pseudo code of splice is as follows:

Splice (fd_in, off_in, fd_out, off_out, len, flags)

The splice system call can establish a pipeline between the read buffer (read buffer) and the network buffer (socket buffer) in kernel space, thus avoiding the CPU copy operation between the two.

Based on the zero-copy mode of splice system call, there will be 2 context switches, 0 CPU copies and 2 DMA copies in the whole copying process. The process for user programs to read and write data is as follows:

The user process initiates a system call to the kernel (kernel) through the splice () function, and the context changes from the user state (user space) to the kernel state (kernel space). CPU uses a DMA controller to copy data from main memory or hard disk to a read buffer (read buffer) in kernel space (kernel space). CPU establishes a pipeline (pipeline) between the read buffer (read buffer) and the network buffer (socket buffer) in kernel space. CPU uses the DMA controller to copy data from the network buffer (socket buffer) to the network card for data transmission. The context switches from kernel state (kernel space) to user state (user space), and the splice system call executes the return.

The splice copy mode also has the problem that the user program can not modify the data. In addition, it uses Linux's pipeline buffering mechanism, which can be used to transfer data in any two file descriptors, but one of its two file descriptor parameters must be a pipe device.

6.6. Copy while writing

In some cases, the kernel buffer may be shared by multiple processes. If a process wants this shared area for write operation, because write does not provide any lock operation, it will cause damage to the data in the shared area. Replication while writing is introduced to protect the data.

Replication on write means that when multiple processes share the same piece of data, if one of the processes needs to modify the data, it needs to be copied to its own process address space. This does not affect the operation of this piece of data by other processes, and each process will copy it only when it is modified, so it is called write-time copy. This approach can reduce system overhead to some extent, and if a process never makes changes to the data accessed, it will never need to be copied.

6.7. Buffer sharing

The buffer sharing method completely rewrites the traditional Icano operation, because the traditional Icano interface is based on data copy, so it is necessary to remove the original interface and rewrite it in order to avoid copying, so this method is a more comprehensive zero-copy technology. at present, a more mature scheme is fbuf (Fast Buffer, fast buffer) implemented on Solaris.

The idea of fbuf is that each process maintains a buffer pool, which can be mapped to both user space (user space) and kernel state (kernel space). The kernel and users share the buffer pool, thus avoiding a series of copy operations.

The difficulty of buffer sharing is that managing shared buffer pools requires close cooperation among applications, network software and device drivers, and how to rewrite API is still in the experimental stage.

7. Linux zero copy comparison

Whether it is the traditional IDMA Copy O copy mode or the introduction of zero copy mode, two times of DMA are necessary, because both times are completed by hardware. The following summarizes the differences of the above-mentioned CPU O copy methods from the aspects of the number of DMA copies, the number of DMA copies and system calls.

Copy mode CPU copy DMA copy system call context switch traditional mode (read + write) 22read / write4 memory mapping (mmap + write) 12mmap / write4sendfile12sendfile2sendfile + DMA gather copy02sendfile2splice02splice28. Java NIO zero copy implementation

The Channel in Java NIO is equivalent to the buffer in the operating system's kernel space (kernel space), while the buffer (Buffer) corresponds to the user buffer (user buffer) in the operating system's user space (user space).

The channel (Channel) is full-duplex (two-way transmission), which can be either a read buffer (read buffer) or a network buffer (socket buffer). The Buffer is divided into heap memory (HeapBuffer) and out-of-heap memory (DirectBuffer), which is the user-mode memory allocated by malloc ().

Out-of-heap memory (DirectBuffer) needs to be reclaimed manually by the application after use, while data from heap memory (HeapBuffer) may be automatically reclaimed during GC. Therefore, when using HeapBuffer to read and write data, in order to avoid buffer data loss due to GC, NIO will first copy the data inside HeapBuffer to the local memory (native memory) in a temporary DirectBuffer. This copy involves a call to sun.misc.Unsafe.copyMemory (), and the implementation principle is similar to memcpy (). Finally, the memory address of the temporarily generated data inside the DirectBuffer is passed to the Imax O calling function, which avoids accessing the Java object to handle Imax O read and write.

8.1. MappedByteBuffer

MappedByteBuffer is an implementation of the zero-copy approach provided by NIO based on memory mapping (mmap), which inherits from ByteBuffer. FileChannel defines a map () method that maps the size-sized area of a file starting from the position location to a memory image file. The abstract method map () method is defined in FileChannel as follows:

Public abstract MappedByteBuffer map (MapMode mode, long position, long size) throws IOException;mode: defines the access mode of memory mapping area (MappedByteBuffer) to memory image files, including read-only (READ_ONLY), read-write (READ_WRITE) and copy-on-write (PRIVATE) modes. Position: the starting address of the file mapping, corresponding to the first address of the memory mapped area (MappedByteBuffer). Size: the length of bytes mapped by the file, the number of bytes after position, corresponding to the size of the memory-mapped area (MappedByteBuffer).

Compared with ByteBuffer, MappedByteBuffer adds three important methods: fore (), load () and isLoad ():

Fore (): for buffers in READ_WRITE mode, the changes to the contents of the buffer are forced to flush to the local file. Load (): loads the contents of the buffer into physical memory and returns a reference to the buffer. IsLoaded (): returns true if the contents of the buffer are in physical memory, false otherwise.

The following is an example of using MappedByteBuffer to read and write files:

Private final static String CONTENT = "Zero copy implemented by MappedByteBuffer"; private final static String FILE_NAME = "/ mmap.txt"; private final static String CHARSET = "UTF-8"; write file data: open the file channel fileChannel and provide read permission, write permission and data cleanup permission, map to a writable memory buffer mappedByteBuffer through fileChannel, write the target data to mappedByteBuffer, and force the buffer changes to the local file through the force () method. @ Testpublic void writeToFileByMappedByteBuffer () {Path path = Paths.get (getClass (). GetResource (FILE_NAME). GetPath ()); byte [] bytes = CONTENT.getBytes (Charset.forName (CHARSET)); try (FileChannel fileChannel = FileChannel.open (path, StandardOpenOption.READ, StandardOpenOption.WRITE, StandardOpenOption.TRUNCATE_EXISTING)) {MappedByteBuffer mappedByteBuffer = fileChannel.map (READ_WRITE, 0, bytes.length) If (mappedByteBuffer! = null) {mappedByteBuffer.put (bytes); mappedByteBuffer.force ();}} catch (IOException e) {e.printStackTrace ();}} read file data: open the file channel fileChannel and provide read-only permission, map to a read-only memory buffer mappedByteBuffer through fileChannel, and read the byte array in mappedByteBuffer to get the file data. @ Testpublic void readFromFileByMappedByteBuffer () {Path path = Paths.get (getClass (). GetResource (FILE_NAME). GetPath ()); int length = CONTENT.getBytes (Charset.forName (CHARSET)) .length; try (FileChannel fileChannel = FileChannel.open (path, StandardOpenOption.READ)) {MappedByteBuffer mappedByteBuffer = fileChannel.map (READ_ONLY, 0, length); if (mappedByteBuffer! = null) {byte [] bytes = new byte [length] MappedByteBuffer.get (bytes); String content = new String (bytes, StandardCharsets.UTF_8); assertEquals (content, "Zero copy implemented by MappedByteBuffer");} catch (IOException e) {e.printStackTrace ();}}

The underlying implementation principle of the map () method is described below. The map () method is an abstract method of java.nio.channels.FileChannel and is implemented by the subclass sun.nio.ch.FileChannelImpl.java. The following is the core code related to memory mapping:

Public MappedByteBuffer map (MapMode mode, long position, long size) throws IOException {int pagePosition = (int) (position% allocationGranularity); long mapPosition = position-pagePosition; long mapSize = size + pagePosition; try {addr = map0 (imode, mapPosition, mapSize);} catch (OutOfMemoryError x) {System.gc (); try {Thread.sleep } catch (InterruptedException y) {Thread.currentThread (). Interrupt ();} try {addr = map0 (imode, mapPosition, mapSize);} catch (OutOfMemoryError y) {throw new IOException ("Map failed", y);}} int isize = (int) size; Unmapper um = new Unmapper (addr, mapSize, isize, mfd) If ((! writable) | | (imode = = MAP_RO) {return Util.newMappedByteBufferR (isize, addr + pagePosition, mfd, um);} else {return Util.newMappedByteBuffer (isize, addr + pagePosition, mfd, um);}}

The map () method allocates a piece of virtual memory to the file as its memory-mapped area through the local method map0 (), and then returns the starting address of the memory-mapped area.

File mapping requires the creation of an instance of MappedByteBuffer in the Java heap. If the first file mapping results in OOM, garbage collection is triggered manually, 100ms is dormant and then the mapping is attempted, and an exception is thrown if it fails. Create an instance of DirectByteBuffer through reflection of Util's newMappedByteBuffer (readable and writable) method or newMappedByteBufferR (read-only) method method, where DirectByteBuffer is a subclass of MappedByteBuffer.

The map () method returns the starting address of the memory-mapped area, and the data for the specified memory can be obtained by (starting address + offset). To some extent, this replaces the read () or write () methods, and the underlying getByte () and putByte () methods of the sun.misc.Unsafe class are used to read and write data.

Private native long map0 (int prot, long position, long mapSize) throws IOException

The above is the definition of the local method (native method) map0, which calls the implementation of the underlying C through JNI (Java Native Interface). The implementation of this native function (Java_sun_nio_ch_FileChannelImpl_map0) is located in the source file native/sun/nio/ch/FileChannelImpl.c under the JDK source package.

JNIEXPORT jlong JNICALLJava_sun_nio_ch_FileChannelImpl_map0 (JNIEnv * env, jobject this, jint prot, jlong off, jlong len) {void * mapAddress = 0; jobject fdo = (* env)-> GetObjectField (env, this, chan_fd); jint fd = fdval (env, fdo); int protections = 0; int flags = 0 If (prot = = sun_nio_ch_FileChannelImpl_MAP_RO) {protections = PROT_READ; flags = MAP_SHARED;} else if (prot = = sun_nio_ch_FileChannelImpl_MAP_RW) {protections = PROT_WRITE | PROT_READ; flags = MAP_SHARED;} else if (prot = = sun_nio_ch_FileChannelImpl_MAP_PV) {protections = PROT_WRITE | PROT_READ; flags = MAP_PRIVATE } mapAddress = mmap64 (0, / * Let OS decide location * / len, / * Number of bytes to map * / protections, / * File permissions * / flags, / * Changes are shared * / fd, / * File descriptor of mapped file * / off) / * Offset into file * / if (mapAddress = = MAP_FAILED) {if (errno = = ENOMEM) {JNU_ThrowOutOfMemoryError (env, "Map failed"); return IOS_THROWN;} return handle (env,-1, "Map failed");} return ((jlong) (unsigned long) mapAddress);}

You can see that the map0 () function finally makes a memory-mapped call to the underlying Linux kernel through the mmap64 () function. The prototype of the mmap64 () function is as follows:

# include void * mmap64 (void * addr, size_t len, int prot, int flags, int fd, off64_t offset)

Here is a detailed description of the meaning and optional values of the parameters of the mmap64 () function:

Addr: the starting address of the file in the memory mapping area of the user's process space, which is a recommended parameter and can usually be set to 0 or NULL, where the real starting address is determined by the kernel. When flags is MAP_FIXED, addr is a required parameter, that is, you need to provide an existing address. Len: the byte length of the file that needs memory mapping prot: control the access rights of the user process to the memory mapping area PROT_READ: read permission PROT_WRITE: write permission PROT_EXEC: execute permission PROT_NONE: no limit flags: control whether changes to the memory mapping area are shared by multiple processes MAP_PRIVATE: changes to the memory mapping area data will not be reflected in the real file When data modification occurs, use write-time copy mechanism MAP_SHARED: changes to the memory mapping area will be synchronized to the real file, modification to the process sharing this memory mapping area is visible MAP_FIXED: not recommended, in this mode the addr parameter specified must provide an existing addr parameter fd: file descriptor. Each map operation causes the reference count of the file to increase by 1, and each unmap operation or ending the process results in the reference count minus the 1offset: file offset. The location of the file to be mapped, the amount of displacement backward from the start address of the file

Here is a summary of the characteristics and shortcomings of MappedByteBuffer:

MappedByteBuffer uses virtual memory outside the heap, so the amount of memory allocated (map) is not limited by the-Xmx parameter of JVM, but there is also a size limit. If the file exceeds the Integer.MAX_VALUE byte limit, you can re-map the following contents of the file with the position parameter. It is true that MappedByteBuffer has high performance when dealing with large files, but it also has some problems, such as memory usage, uncertain file closure and so on. Files opened by MappedByteBuffer are closed only in garbage collection, and this time point is uncertain. MappedByteBuffer provides a mmap () method for file-mapped memory and a unmap () method for freeing mapped memory. However, unmap () is a private method in FileChannelImpl and cannot display the call directly. Therefore, the user program needs to manually release the memory area occupied by the mapping through the call to the clean () method of the sun.misc.Cleaner class reflected by Java. Public static void clean (final Object buffer) throws Exception {AccessController.doPrivileged ((PrivilegedAction) ()-> {try {Method getCleanerMethod = buffer.getClass (). GetMethod ("cleaner", new Class [0]); getCleanerMethod.setAccessible (true); Cleaner cleaner = (Cleaner) getCleanerMethod.invoke (buffer, new Object [0]); cleaner.clean () } catch (Exception e) {e.printStackTrace ();}}); DirectByteBuffer

The object reference of DirectByteBuffer is located in the heap of the Java memory model. JVM can allocate and reclaim the memory of DirectByteBuffer objects. Generally, the static method allocateDirect () of DirectByteBuffer is used to create DirectByteBuffer instances and allocate memory.

Public static ByteBuffer allocateDirect (int capacity) {return new DirectByteBuffer (capacity);}

The byte buffer inside DirectByteBuffer is located in the direct memory outside the heap (user mode). It allocates memory through the local method allocateMemory () of Unsafe, and the underlying call is the malloc () function of the operating system.

DirectByteBuffer (int cap) {super (- 1,0, cap, cap); boolean pa = VM.isDirectMemoryPageAligned (); int ps = Bits.pageSize (); long size = Math.max (1L, (long) cap + (pa? Ps: 0); Bits.reserveMemory (size, cap); long base = 0; try {base = unsafe.allocateMemory (size);} catch (OutOfMemoryError x) {Bits.unreserveMemory (size, cap); throw x;} unsafe.setMemory (base, size, (byte) 0); if (pa & & (base% ps! = 0)) {address = base + ps-(base & (ps-1)) } else {address = base;} cleaner = Cleaner.create (this, new Deallocator (base, size, cap); att = null;}

In addition, a Deallocator thread is created when initializing DirectByteBuffer, and the direct memory is reclaimed through the freeMemory () method of Cleaner, and the bottom layer of freeMemory () calls the operating system's free () function.

Private static class Deallocator implements Runnable {private static Unsafe unsafe = Unsafe.getUnsafe (); private long address; private long size; private int capacity; private Deallocator (long address, long size, int capacity) {assert (address! = 0); this.address = address; this.size = size; this.capacity = capacity;} public void run () {if (address = = 0) {return } unsafe.freeMemory (address); address = 0; Bits.unreserveMemory (size, capacity);}}

Because the use of DirectByteBuffer allocates the local memory of the system, which is not within the scope of JVM control, so the recovery of direct memory is different from that of heap memory. If direct memory is not used properly, it is easy to cause OutOfMemoryError.

After all that has been said, what does DirectByteBuffer have to do with zero copy? As mentioned earlier, when MappedByteBuffer does memory mapping, its map () method creates a buffer instance through Util.newMappedByteBuffer (). The initialization code is as follows:

Static MappedByteBuffer newMappedByteBuffer (int size, long addr, FileDescriptor fd, Runnable unmapper) {MappedByteBuffer dbb; if (directByteBufferConstructor = = null) initDBBConstructor (); try {dbb = (MappedByteBuffer) directByteBufferConstructor.newInstance (new Object [] {new Integer (size), new Long (addr), fd, unmapper}) } catch (InstantiationException | IllegalAccessException | InvocationTargetException e) {throw new InternalError (e);} return dbb;} private static void initDBBRConstructor () {AccessController.doPrivileged (new PrivilegedAction () {public Void run () {try {Class cl = Class.forName ("java.nio.DirectByteBufferR") Constructor ctor = cl.getDeclaredConstructor (new Class [] {int.class, long.class, FileDescriptor.class, Runnable.class}); ctor.setAccessible (true); directByteBufferRConstructor = ctor } catch (ClassNotFoundException | NoSuchMethodException | IllegalArgumentException | ClassCastException x) {throw new InternalError (x);} return null;}});}

DirectByteBuffer is a concrete implementation class of MappedByteBuffer. In fact, the Util.newMappedByteBuffer () method takes the constructor of DirectByteBuffer through the reflection mechanism, and then creates an instance of DirectByteBuffer, corresponding to a separate constructor for memory mapping:

Protected DirectByteBuffer (int cap, long addr, FileDescriptor fd, Runnable unmapper) {super (- 1,0, cap, cap, fd); address = addr; cleaner = Cleaner.create (this, unmapper); att = null;}

Therefore, in addition to allowing the allocation of direct memory from the operating system, DirectByteBuffer itself also has a file memory mapping function, which is not explained here. What we need to note is that DirectByteBuffer provides random read get () and write write () operations to memory image files on the basis of MappedByteBuffer.

Random read operation of memory image file public byte get () {return ((unsafe.getByte (ix (nextGetIndex ())} public byte get (int I) {return ((unsafe.getByte (ix (checkIndex (I);} random write operation of memory image file public ByteBuffer put (byte x) {unsafe.putByte (ix (nextPutIndex ()), ((x); return this } public ByteBuffer put (int I, byte x) {unsafe.putByte (ix (checkIndex (I)), ((x); return this;}

The random read and write of the memory image file is located by the ix () method, which calculates the pointer address through the memory first address (address) of the memory mapping space and the given offset I, and then reads or writes the data pointed to by the get () and put () methods of the unsafe class.

Private long ix (int I) {return address + ((long) I sz) return 0; int icount = (int) Math.min (count, Integer.MAX_VALUE); / / check offset if ((sz-position))

< icount) icount = (int)(sz - position); long n; if ((n = transferToDirectly(position, icount, target)) >

= 0) return n; if (n = transferToTrustedChannel (position, icount, target)) > = 0) return n; return transferToArbitraryChannel (position, icount, target);}

Next, we focus on the implementation of the transferToDirectly () method, which is the essence of transferTo () achieving zero copy through sendfile. As you can see, the transferToDirectlyInternal () method first gets the file descriptor targetFD of the destination channel WritableByteChannel, acquires the synchronization lock, and then executes the transferToDirectlyInternal () method.

Private long transferToDirectly (long position, int icount, WritableByteChannel target) throws IOException {/ / omit the process of obtaining targetFD from target if (nd.transferToDirectlyNeedsPositionLock ()) {synchronized (positionLock) {long pos = position (); try {return transferToDirectlyInternal (position, icount, target, targetFD);} finally {position (pos) Else {return transferToDirectlyInternal (position, icount, target, targetFD);}}

Finally, the local method transferTo0 () is called by transferToDirectlyInternal () to try to transfer the data in the form of sendfile. If the system kernel does not support sendfile at all, such as the Windows operating system, return UNSUPPORTED and identify transferSupported as false. If the system kernel does not support some features of sendfile, for example, the lower version of the Linux kernel does not support DMA gather copy operation, return UNSUPPORTED_CASE and identify pipeSupported or fileSupported as false.

Private long transferToDirectlyInternal (long position, int icount, WritableByteChannel target, FileDescriptor targetFD) throws IOException {assert! nd.transferToDirectlyNeedsPositionLock () | | Thread.holdsLock (positionLock); long n =-1; int ti =-1; try {begin (); ti = threads.add () If (! isOpen ()) return-1; do {n = transferTo0 (fd, position, icount, targetFD);} while ((n = = IOStatus.INTERRUPTED) & & isOpen ()); if (n = = IOStatus.UNSUPPORTED_CASE) {if (target instanceof SinkChannelImpl) pipeSupported = false; if (target instanceof FileChannelImpl) fileSupported = false Return IOStatus.UNSUPPORTED_CASE;} if (n = = IOStatus.UNSUPPORTED) {transferSupported = false; return IOStatus.UNSUPPORTED;} return IOStatus.normalize (n);} finally {threads.remove (ti); end (n >-1);}}

The local method (native method) transferTo0 () calls the underlying C function through JNI (Java Native Interface), and this native function (Java_sun_nio_ch_FileChannelImpl_transferTo0) is also located in the native/sun/nio/ch/FileChannelImpl.c source file under the JDK source package. The JNI function Java_sun_nio_ch_FileChannelImpl_transferTo0 () precompiles different systems based on conditional compilation. The following is the call wrapper provided by JDK based on the Linux system check transferTo ().

# if defined (_ _ linux__) | | defined (_ _ solaris__) # include # elif defined (_ AIX) # include # elif defined (_ ALLBSD_SOURCE) # include # define lseek64 lseek#define mmap64 mmap#endifJNIEXPORT jlong JNICALLJava_sun_nio_ch_FileChannelImpl_transferTo0 (JNIEnv * env, jobject this, jobject srcFDO, jlong position, jlong count Jobject dstFDO) {jint srcFD = fdval (env, srcFDO) Jint dstFD = fdval (env, dstFDO); # if defined (_ _ linux__) off64_t offset = (off64_t) position; jlong n = sendfile64 (dstFD, srcFD, & offset, (size_t) count); return n _ transferelif defined (_ _ solaris__) result = sendfilev64 (dstFD, & sfv, 1, & numBytes); return result;#elif defined (_ _ APPLE__) result = sendfile (srcFD, dstFD, position, & numBytes, NULL, 0) Return result;#endif}

For Linux, Solaris and Apple systems, the bottom layer of the transferTo0 () function executes the sendfile64 system call to complete the zero-copy operation. The prototype of the sendfile64 () function is as follows:

# include ssize_t sendfile64 (int out_fd, int in_fd, off_t * offset, size_t count)

Here is a brief introduction to the meaning of the parameters of the sendfile64 () function:

Out_fd: file descriptor to be written in _ fd: file descriptor to be read offset: specifies the read location of the file stream corresponding to in_fd. If empty, count starts from the start position by default: specifies the number of bytes transferred between the file descriptors in_fd and out_fd

Before Linux 2.6.3, out_fd must be a socket, but after Linux 2.6.3, out_fd can be any file. In other words, the sendfile64 () function can not only transfer network files, but also implement zero-copy operations on local files.

9. The other zero copies achieve 9.1. Netty zero copy

The zero copy in Netty is quite different from the zero copy at the operating system level mentioned above. The zero copy of Netty is entirely based on (Java level) user mode, and it is more biased towards the concept of data operation optimization, which is shown in the following aspects:

Netty wraps the tranferTo () method of java.nio.channels.FileChannel through the DefaultFileRegion class, and the data of the file buffer can be sent directly to the destination channel (Channel) when the file is transferred. ByteBuf can wrap the byte array, ByteBuf, and ByteBuffer into a ByteBuf object through wrap operation, thus avoiding the copy operation ByteBuf supports slice operation, so ByteBuf can be decomposed into multiple ByteBuf sharing the same storage area. Avoid copying of memory Netty provides CompositeByteBuf class, it can merge multiple ByteBuf into a logical ByteBuf, avoiding the copy between each ByteBuf

Among them, Article 1 belongs to zero-copy operation at the operating system level, and the latter three can only be regarded as data operation optimization at the user level.

9.2. Comparison between RocketMQ and Kafka

RocketMQ chooses the zero-copy method of mmap + write, which is suitable for data persistence and transmission of small block files such as business-level messages, while Kafka adopts the zero-copy method of sendfile, which is suitable for data persistence and transmission of high-throughput large-block files such as Syslog messages. However, it is worth noting that Kafka's index files use mmap + write mode, while data files use sendfile mode.

Message queue zero copy mode advantages and disadvantages RocketMQmmap + write is suitable for small block file transfer, when called frequently, the efficiency is very high and can not make good use of DMA mode, it will consume more CPU than sendfile, memory security control is complex, need to avoid JVM Crash problems Kafkasendfile can use DMA mode, consume less CPU, large block file transfer efficiency is high, no memory security problem small block file efficiency is lower than mmap mode, can only be transferred in BIO mode Cannot use NIO method to summarize

This article begins with a detailed description of the physical memory and virtual memory in the Linux operating system, the concepts of kernel space and user space, and the internal hierarchical structure of Linux. On this basis, this paper further analyzes and compares the difference between the traditional Linux O mode and the zero-copy mode, and then introduces several zero-copy implementations provided by the Linux kernel, including memory mapping mmap, sendfile, sendfile + DMA gather copy and splice, and compares them in terms of system calls and copy times. Then it analyzes the implementation of zero copy in Java NIO from the source code, including MappedByteBuffer based on memory mapping (mmap) and FileChannel based on sendfile. Finally, the paper briefly describes the zero-copy mechanism in Netty and the difference between RocketMQ and Kafka message queues in zero-copy implementation.

Technical official account

This account continues to share useful information on back-end technologies, including virtual machine basics, multithreaded programming, high-performance frameworks, asynchronous, caching and messaging middleware, distributed and micro services, architecture learning and advanced learning materials and articles.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.