How to use RDMA 07/01 Update SLTechnology News&Howtos

How to use RDMA

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

How to use RDMA, I believe that many inexperienced people do not know what to do, so this article summarizes the causes of the problem and solutions, through this article I hope you can solve this problem.

Unstoppable RDMA

Nowadays, the network bandwidth of the server is getting higher and higher. When the network bandwidth exceeds the 10-gigabit line, the cost of the operating system for dealing with network IO becomes more and more difficult to ignore. In some network IO-intensive services, the operating system itself has become the bottleneck of network communication, which will not only lead to the increase of call delay (especially long tail), but also affect the overall throughput of services.

Compared with the development speed of network bandwidth, the stagnant development of CPU performance is the main cause of the above problems. Therefore, in order to fundamentally solve the inefficient problem of CPU participating in network transmission, it is necessary to rely more on the ability of special chips, RDMA high-performance network is unstoppable.

RDMA (Remote Direct Memory Access) can be simply understood as the network card completely bypassing CPU to achieve the in-memory data exchange between the two servers. As a hardware-implemented network transmission technology, it can greatly improve the efficiency of network transmission and help network IO-intensive services (such as distributed storage, distributed database, etc.) to obtain lower latency and higher throughput.

Specifically, the application of RDMA technology depends on the network card supporting RDMA function and the corresponding driver. As shown in the following figure, once the application has allocated resources, it can directly give the memory address and length information of the data to be sent to the network card. The network card pulls the data from the memory, completes the message encapsulation by the hardware, and then sends it to the corresponding receiver. After receiving the RDMA message, the receiver is directly unencapsulated by the hardware, and after taking out the data, it is directly placed in the memory location specified by the application program.

Because the whole IO process does not need the participation of CPU, the participation of the operating system kernel, no system calls, no interruption, and no memory copy, RDMA network transmission can achieve extremely high performance. In the limit benchmark test, the delay of RDMA can reach 1us level, and the throughput can even reach 200G.

RDMA Technical Note

It is important to note that the use of RDMA requires the code cooperation of the application (RDMA programming). Unlike traditional TCP transports, RDMA does not provide socket API encapsulation, but is called through verbs API (using libibverbs). In order to avoid the extra overhead of the middle tier, verbs API adopts the semantic form close to the hardware implementation, which leads to a great difference between the usage and the socket API. Therefore, it is not easy for most developers to adapt the original application to RDMA or to write a new native RDMA application.

What is the difficulty of RDMA programming?

As shown in the figure, in socket API, the main interfaces used to send and receive data are as follows:

Socket API

Where fd in the write and read operations is a file descriptor that identifies a connection. The data to be sent by the application is copied into the system kernel buffer via write; read actually copies the data from the system kernel buffer. In most applications, fd is usually set to non-blocking, that is, if the system kernel buffer is full, the write operation will return directly; if the system kernel buffer is empty, the read operation will return directly. In order to know the state changes of the kernel buffer in the first place, the application needs an epoll mechanism to listen for EPOLLIN and EPOLLOUT events. If the epoll_wait function returns because of these events, the next write and read operations can be triggered. This is the basic use of socket API. For comparison, in verbs API, the main interfaces used to send and receive data are as follows:

Verbs API

Where ibv_ is the prefix of functions and structures in the libibverbs library. Ibv_post_send is similar to sending operation and ibv_post_recv is similar to receiving operation. The qp (queue pair) in the send and receive operation is similar to the fd in socket API, as the corresponding identity of a connection. The wr (work request) structure contains the memory address (the virtual address of the process) and the length of the data to be sent / received. Ibv_poll_cq exists as an event detection mechanism, similar to epoll_wait.

At first glance, RDMA programming seems simple, as long as the above functions are replaced. But in fact, the corresponding relations mentioned above are all approximate and similar, not equivalent. The key difference is that socket API is synchronous, while RDMA API is asynchronous (note that async and non-blocking are two different concepts).

Specifically, the successful return of the ibv_post_send function simply means that the send request was successfully submitted to the network card, and there is no guarantee that the data was actually sent. If you immediately write to the memory where the data is sent, the data sent is likely to be incorrect. Socket API is a synchronous operation, and the write function returns successfully, which means that the data has been written to the kernel buffer. Although the data may not really be sent at this time, the application can dispose of the memory where the data is sent at will.

On the other hand, the events obtained by ibv_poll_cq are different from those obtained by epoll_wait. The former indicates that a send or receive request previously submitted to the network card has been completed, while the latter indicates that a new message has been sent or received successfully. These semantic changes affect the memory usage patterns and API invocation patterns of upper-layer applications.

In addition to the semantic difference between synchronous and asynchronous, there is another key element of RDMA programming, that is, all the data involved in sending and receiving must be registered in the memory.

The so-called memory registration, simply understood, is to bind the mapping relationship between the virtual address and the physical address of a section of memory and register it to the network card hardware. The reason for this is that the memory addresses submitted by both sending and receiving requests are virtual addresses. Only by completing the memory registration can the network card translate the virtual address in the request into a physical address and skip CPU to do direct memory access. Memory registration (and de-registration) is a very slow operation. In practical applications, it is usually necessary to build a memory pool to avoid frequent calls to registration functions through one-time registration and reuse.

There are many details about RDMA programming that ordinary network programming does not care about (such as flow control, TCP fallback, non-interrupt mode, etc.), which will not be covered here. All in all, RDMA programming is not an easy task. So, how can developers quickly use high-performance network technologies like RDMA?

For more information, use RDMA in brpc.

There are many comparisons between socket API and verbs API mentioned above, mainly to set off the complexity of RDMA programming itself. In fact, in the actual production environment, there are not many services that directly call socket API for network transmission, and most of them use socket API indirectly through the rpc framework. A complete rpc framework needs to provide a complete set of network transmission solutions, including data serialization, error handling, multithreading and so on. Brpc is an open source C++-based rpc framework from Baidu. Compared with grpc, it is more suitable for scenarios with high performance requirements. In addition to the traditional TCP transmission, brpc also provides a way to use RDMA to further break through the performance limitations of the operating system itself. For specific implementation details, interested friends can refer to the source code (https://github.com/apache/incubator-brpc/tree/rdma) on github.

Use RDMA on the brpc client side

Use RDMA on the brpc server side

The methods of using RDMA for client and server in brpc are listed above, that is, when channel and server are created, the option of use_rdma can be set to true (default is false, that is, TCP is used).

Yes, just these two lines of code. If your application itself is built on brpc, it will take a few minutes to migrate from TCP to RDMA. Of course, after the above quick start, if there are more advanced tuning requirements, brpc also provides some runtime flag parameters that can be adjusted, such as memory pool size, qp/cq size, polling alternative interrupts, and so on.

The performance benefits of brpc using RDMA are illustrated through echo benchmark (this benchmark can be found in the rdma_performance directory in the github code). In the test environment of 25G network, for messages below 2KB, the maximum QPS of server side is increased by more than 50% after using RDMA, and the average delay under 200k QPS is reduced by more than 50%.

Maximum QPS of server end under Echo benchmark (25G network)

Average delay of Echo benchmark at 200k QPS (25G network)

RDMA's performance benefits to echo benchmark are for reference only. There is a huge difference between workload and echo in real applications. For some businesses, the benefit of using RDMA may be less than the above value, because the cost of the network part accounts for only a portion of the business overhead. But for other services, the benefit of using RDMA is even higher than the above value because it avoids the interference of kernel operations to the business logic. Here are two examples of applying brpc:

In Baidu distributed block storage business, the average latency test using RDMA is about 30% lower than that using TCP,4KB fio (RDMA only optimizes network IO, storage IO is not affected by RDMA).

In Baidu's distributed memory KV service, compared with using TCP,200k QPS, the average delay of a single query 30key is reduced by 89%, and the 99 quantile delay is reduced by 96%.

RDMA needs infrastructure support

RDMA is a new high-performance network technology, which is of great significance for IO-intensive services in data centers that can be controlled at both ends of the communication, such as HPC, machine learning, storage, database and so on. We encourage developers of related businesses to pay attention to RDMA technology and try to build their own applications using brpc to smooth migration to RDMA. However, it is important to point out that RDMA is currently not as general as TCP, and there are some infrastructure limitations to be aware of:

RDMA needs network card hardware support. Common 10 Gigabit network cards generally do not support this technology.

The normal use of RDMA depends on physical network support.

Baidu Smart Cloud has accumulated profound experience in RDMA infrastructure. Thanks to advanced hardware configuration and powerful engineering technology, users can fully reap the performance benefits of RDMA technology through physical machines or containers in a 25g or even 100G network environment, while handing over complex physical network configuration tasks (such as lossless network PFC and display congestion notification ECN) to Baidu intelligent cloud technical support personnel. Developers who need high-performance computing, high-performance storage and other services are welcome to consult Baidu Intelligent Cloud.

After reading the above, have you mastered how to use RDMA? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.