Optimizing the path of RSSD Cloud disk by using SPDK Technology 10/22 Update SLTechnology News&Howtos

Optimizing the path of RSSD Cloud disk by using SPDK Technology

2025-10-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/03 Report--

The following brings you the optimized path of the application of SPDK technology to RSSD cloud disk, hoping to bring some help to you in practical application. Cloud disk involves more things, there are not many theories, and there are many books on the Internet. Today, we will use the accumulated experience in the industry to do an answer.

A brief introduction

Users' requirements for ultra-high concurrency and large-scale computing promote the continuous development of storage hardware technology, the performance of storage clusters is getting better and better, the latency is getting lower and lower, and the performance requirements for the overall IO path are getting higher and higher. In the cloud disk scenario, the IO path between the IO request generation to the backend storage cluster and the return is relatively complex, and the virtualized IO path may especially become a performance bottleneck, because all IO in the virtual machine needs to be sent to the backend storage system through it. We used SPDK to optimize the virtualized IO path, proposed an open source unsolved hot upgrade and online migration solution for SDPK, and successfully applied it in high-performance cloud disk scenarios, with a maximum of 1.2 million IOPS for RSSD cloud disks. This article mainly shares some of our experiences in this area.

The basic principle of two SPDK vhost

SPDK (Storage Performance Development Kit) provides a set of tools and libraries for writing high-performance, scalable, user-mode storage applications. The basic components are user-mode, polling, asynchronous, lock-free NVMe drivers, and provide zero-copy, highly parallel access to SSD directly from user-space applications.

In the virtualized IO path, virtio is a commonly used paravirtualization solution, while the underlying layer of virtio communicates through vring. Let's first introduce the basic principles of virtio vring. Each virtio vring mainly consists of the following parts:

Cdn.xitu.io/2019/5/31/16b0c856d0acf537?w=437&h=307&f=jpeg&s=12871 ">

The desc table array, the size of which is equal to the queue depth of the device, is typically 128. Each element in the array represents an IO request, and the element contains basic information such as the memory address where the IO data is stored, the length of the IO, and so on. Generally, an IO request corresponds to an desc array element. Of course, there are also cases where IO involves multiple memory pages, so you need to connect multiple desc into a linked list, and unused desc elements will be connected to the free_head through its own next pointer to form a linked list for subsequent use.

Available array, which is a circular array. Each item represents an index of a desc array. When processing an IO request, you can get an index from this array to find the corresponding IO request in the desc array.

The used array, which is similar to avail, but is used to represent the completed IO request. When an IO request is processed, the desc array index of the request will be saved in the array. After being notified, the front-end virtio driver will scan the data to determine whether the request has been completed, and if it is completed, the desc array items corresponding to the request will be reclaimed for the next IO request to use.

The principle of SPDK vhost is relatively simple. During initialization, the vhost driver of qemu first sends the information of the above virtio vring array to SPDK, and then SPDK determines whether there is an available request by constantly searching the IO array, and if there is a request, it processes it. After processing, the index is added to the used array, and the virtio front end is notified through the corresponding eventfd.

When SPDK receives an IO request, it is only a pointer to the request, and the pointer needs to be able to access this part of memory directly when processing, and the pointer points to an address in the qemu address space, which obviously cannot be used directly, so some conversion needs to be done here.

When using SPDK, the virtual machine will use large pages of memory, and the virtual machine will send the information of large pages of memory to SPDK,SPDK during initialization. It will parse this information and map the same large page memory to its own address space through mmap, thus realizing memory sharing. When SPDK gets the pointer of qemu address space, it can easily convert the pointer to SPDK address space by calculating the offset.

From the above principle, we can know that SPDK vhost can quickly transfer IO requests between the two by sharing large page memory. In this process, there is no need to make a memory copy, it is completely the transfer of pointers, so it greatly improves the performance of the IO path.

We compare the latency of the original qemu cloud disk driver with that of the SPDK vhost. In order to simply compare the performance of the virtualized IO path, we use the method of returning directly after receiving the IO:

1. Single queue (1 iodepth, 1 numjob)

Delay of qemu network drive:

SPDK vhost delay:

It can be seen that the delay decreases obviously in the case of single queue, and the average delay decreases from the original 130us to 7.3us.

two。 Multiple queues (128 iodepth,1 numjob)

Delay of qemu network drive:

SPDK vhost delay:

The IO delay in multi-queues is generally larger than that in single queues. It can be seen that in multi-queue scenarios, the average delay has also decreased from 3341us to 1090us, to 1/3.

Three SPDK hot upgrades

When we first started using SPDK, we found that SPDK lacked an important feature-hot upgrades. Developing custom bdev devices based on SPDK using SPDK will definitely involve version upgrade, and there is no guarantee that the SPDK process will not crash down. Therefore, once the back-end SPDK is restarted or crash, the IO in the front-end qemu will be stuck, even after the SPDK restart, it will not be able to recover.

We carefully studied the initialization process of SPDK and found that at the initial stage of SPDK vhost startup, qemu will issue some configuration information, but these configuration information will be lost after SPDK restart. Does this mean that SPDK can work properly as long as these configuration information is reissued after SPDK restart? We have tried to add an automatic reconnection mechanism to qemu, and once the automatic reconnection is complete, the configuration information will be sent again in the order of initialization. After the completion of the development, the preliminary test found that it could recover automatically, but with more stringent pressure tests, it was found that it could only be recovered when the SPDK exited normally, and the IO would still be stuck and unable to recover after the SPDK crash exited. From a phenomenal point of view, it should be that part of the IO has not been processed, so the virtual machine on the qemu side has been waiting for these IO returns.

Through an in-depth study of the mechanism of virtio vring, we find that when SPDK exits normally, it ensures that all IO has been processed and returned before exiting, that is, the virtio vring is clean. This guarantee cannot be made in case of unexpected crash. In case of unexpected crash, there is still part of the IO in the virtio vring that has not been processed, so you need to scan the virtio vring to send out the outstanding requests after the SPDK is restored. The complication of this problem is that requests in virtio vring are sent and processed in order, but not in the order in which they are actually completed.

Suppose there are 6 IO in the available ring of virtio vring, and the index number is 1, 2, 3, 4, 5, 5, 5, 6, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, respectively, but it is possible that request 1 and 4 have been completed and returned successfully, as shown in the following figure, while 2 IO 3, 5 and 6 have not been completed yet. At this time, if crash, you need to re-send the four IO of 2Jing 3Jing 5Jing 6 after restart, while 1 and 4 cannot be processed again, because the processing has been completed and returned, and the corresponding memory may have been released. In other words, we cannot simply scan the available ring to determine which IO needs to be reissued. We need a piece of memory to record the status of each request in the virtio vring. When rebooted, we can determine which IO needs to be reissued according to the status recorded in the memory, and this memory cannot be lost due to SPDK restart, so it is obviously most appropriate to use the memory of the qemu process. So we request a piece of shared memory for each virtio vring in qemu, and send it to SPDK,SPDK during initialization. When processing IO, we will record the status of each virtio vring request in that memory, and use this information to find out the requests that need to be reissued after unexpected crash recovery.

Four SPDK online migration

The virtualized IO path provided by SPDK vhost performs very well, so is it possible for us to use this IO path instead of the original virtualized IO path? We have done some research and found that SPDK is not as perfect as the existing qemu IO path in some functions, among which the most important is the online migration function. The lack of this function is the biggest obstacle for us to use SPDK vhost to replace the original IO path.

SPDK is designed for network storage, so it supports the migration of device status, but does not support the online migration of data on the device. Qemu itself supports online migration, including the online migration of device status and data on the device, but does not support online migration when using vhost mode. The main reason is that after using vhost, qemu only controls the control link of the device, while the data link of the device has been hosted to the back-end SPDK, that is to say, qemu does not have the data flow IO path of the device, so it does not know that those parts of a device have been written.

After reviewing the existing qemu online migration function, we feel that this technical difficulty is not insurmountable, so we decided to develop a set of online migration function for vhost storage devices in qemu.

The principle of online migration of block devices is relatively simple, which can be divided into two steps. The first step copies the whole data to the target virtual machine from beginning to end, because the copying process takes a long time, and it is certain that the copied data will be written again. In this step, the data blocks that are written dirty again will be set in bitmap and left to the second step to be processed. In step 2, we use bitmap to find the remaining dirty data blocks, send these dirty data blocks to the target side, and finally block all the IO, and then synchronize the remaining dirty data blocks to the target side to complete the migration.

The principle of online migration of SPDK is the same as above, but the complexity is that qemu has no data stream IO path, so we have developed a set of drivers in qemu that can be used to implement the dedicated data flow IO path for migration, and create a bitmap between qemu and SPDK to save the number of dirty pages of block devices by means of shared memory and mutual exclusion between processes. Considering that SPDK is a separate process and unexpected crash may occur, we add PTHREAD_MUTEX_ROBUST feature to the pthread mutex used to prevent the occurrence of deadlock after unexpected crash. The overall architecture is as follows:

Five SPDK IO uring experience

IO uring is a relatively new technology in the kernel, which is integrated only after the upstream kernel 5.1. this technology mainly optimizes the existing aio series system calls by sharing memory in user mode and kernel mode, so that it is not necessary to make system calls every time to submit IO, which reduces the overhead of system calls and provides higher performance.

SPDK has included bdev that supports uring in the latest version 19.04, but this feature only adds code and is not open to the public. Of course, we can experience this feature by modifying the SPDK code.

First of all, the code that only contains io uring in the new version of SPDK is not even compiled by default, so we need to make some changes:

1. Install the latest liburing library and modify spdk's config file to open io uring compilation

two。 Referring to other bdev implementations, add rpc calls to io uring devices so that we can create io uring devices like other bdev devices

3. The latest liburing has changed the io_uring_get_completion call to io_uring_peek_cqe and needs to be used with io_uring_cqe_seen, so we also need to adjust the code implementation of io uring in SPDK to avoid the error that the io_uring_get_completion function cannot be found during compilation:

4. Use the modified open call and open the file in O_SYNC mode to ensure that we land when the data is written and returned, and it is more efficient than calling fdatasync. We also made the same modification to aio bdev and added read-write mode:

After the above modifications, the spdk io uring device can be created successfully. Let's make a performance comparison:

When using aio bdev:

When using io uring bdev:

It can be seen that io uring has a good advantage in terms of maximum performance and latency. IOPS has increased by about 20%, and latency has been reduced by about 10%. This result is actually limited by the maximum performance of the underlying hardware devices, and has not yet reached the upper limit of io uring.

Six summaries

With the application of SPDK technology, there is no bottleneck to improve the performance of virtualized IO path, and it also promotes UCloud high-performance cloud disk products to better play the performance of back-end storage. Of course, the application of a technology is not so smooth, and we have encountered a lot of problems in the process of using SPDK. In addition to the above sharing and some bug fixes, we have also submitted them to the SPDK community. SPDK as a fast-growing iterative project, each version will bring us a surprise. There are also many interesting features waiting for us to explore and further apply to improve the performance of cloud disks and other products.

After reading the above about the optimization path of SPDK technology to RSSD cloud disk, if there is anything else you need to know, you can find out what you are interested in in the industry information or find our professional and technical engineers for answers. Technical engineers have more than ten years of experience in the industry.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.