Practice of UCloud 25G Intelligent Network Card with High performance based on OpenvSwitch Uninstall 04/11 Update SLTechnology News&Howtos

Practice of UCloud 25G Intelligent Network Card with High performance based on OpenvSwitch Uninstall

2025-04-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/03 Report--

With the rapid growth of traffic of e-commerce and other users during the peak periods of double 11 and second kill, the demand for improving the performance of virtual machine network is becoming more and more urgent. 25G network has gradually become a standard configuration. In order to solve the performance bottleneck caused by the traditional pure software Virtual Switch solution, after investigating the mainstream intelligent network card solutions in the industry, we finally decided to adopt the open source scheme based on OpenvSwitch, and successfully applied it in the public cloud.

Compared with the traditional scheme, the new intelligent network card scheme has the performance of packet 24Mpps in the whole switch forwarding, the receiving performance of a single VF is up to 15Mpps, and the overall performance of the network card is improved by more than 10 times. After being applied to the CVM, the network capacity can be increased by at least 4 times, and the delay can be reduced by 3 times, which can effectively solve the stability problem of e-commerce and other business peak periods. This article will talk in detail about the pits and solutions encountered in the process of selection and landing of the new scheme, hoping to give people reference and inspiration.

Comparison of mainstream intelligent network card schemes in the industry

The performance bottleneck of the traditional software Virtual Switch is that after receiving the message from the physical network card, it is sent to the Vhost thread according to the forwarding logic, and then transmitted to the virtual machine by the vhost. In this way, the processing capacity of the vhost becomes the key to affect the network performance of the virtual machine.

Therefore, the unloading of network traffic through 25G SmartNIC on the host has become the mainstream direction recognized by the industry. At this stage, the realization of intelligent network cards is in full bloom, for example, AWS adopts a multi-core scheme based on general ARM, Azure adopts a scheme based on FPGA, Huawei Cloud adopts a scheme based on dedicated network processor (NP), and Aliyun adopts a scheme based on programmable ASIC chips. At present, each scheme has its own advantages and disadvantages, and there is no plan to dominate the world.

Based on the multi-core scheme of general ARM and MIPS, the vSwitch that used to run on the host is simply transplanted to the network card, which can support both Linux Kernel and DPDK, so as to achieve the purpose of releasing host computing resources. Other schemes based on FPGA, NP and programmable ASIC mostly maintain a fast forwarding path (Fast Path) on the network card. After receiving the message, first check whether the processing rules of this kind of message have been cached in Fast Path, if found, then execute the action directly according to the rules, otherwise it will be forwarded to Slow Path for processing. This Slow Path can be either DPDK or Linux Kernel.

Therefore, the most important thing for Fast Path is to see if it supports enough Action, as well as the extensibility of custom Action. In addition to the private interfaces of various manufacturers, Slow Path and Fast Path communications also have standard TC Offload interfaces and RTE Flows interfaces provided by DPDK.

However, the power consumption and cost of FPGA is high, the research and development cycle is long and can not land quickly, from hardware to software need to invest a lot of resources. Other software customization schemes based on third-party network card manufacturers rely heavily on third-party manufacturers for the stability of network card software, so they can not quickly locate and troubleshoot when they encounter problems.

Our choice.

In the absence of a perfect implementation scheme in the industry, we began to turn our attention to open source technology, because OpenvSwitch itself supports Linux Tc Flower Offload-based uninstall interface, has little impact on the existing control and management, and can quickly develop applications to users. Therefore, we chose the OpenvSwitch open source scheme based on Tc Flower Offload.

Message processing can be seen as sending a message from reception to the final destination through a series of sequential operations, the most typical of which is sending or discarding. This series of operations is usually a continuous match and then an action. The Tc Flower of the Linux kernel TC subsystem can control and forward messages based on the flow, and the flow is usually classified based on the common message domain, which forms a match term called flow key. Flow key includes the common message domain and optional tunnel information, and the TC actions performs operations such as discarding, modifying, sending and so on.

This approach is similar to the OpenvSwitch classification. Offload through Tc Flower classifier provides a powerful method for flow-based systems to increase throughput and reduce CPU utilization.

Landing practice of Intelligent Network Card based on OpenvSwitch unloading

After the proposal was selected, we began to carry out landing practice on the original structure. This process was not smooth. In the process of landing, we also encountered several problems in several aspects:.

1. Migration of virtual machines

At the beginning of landing, the virtual machine must be migrated first. Because the SmartNIC of each manufacturer is based on VF passthrough, and the immobility of VF brings difficulties to the migration of virtual machines. In the industry, Azure mainly solves this problem through the solutions of bonding VF and virtio-net device, but this method requires the intervention of users at a certain level, which brings the problem of virtual machine image management.

Through the investigation of upstream (https://patchwork.ozlabs.org

/ cover/920005/) "Enable virtio_net toact as a standby for a passthrough device" solution, we find that in this environment, users do not need to manually set bonding operations or create specific images, which can perfectly solve the problem of user intervention. Finally, we adopt the method of VF+standby virtio-net to migrate the virtual machine. The specific migration process is:

Create a virtual machine with its own virtio-net network card, then select a VF on the Host as a hostdev network card, set the same MAC address as the virtio-net network card, and attach into the virtual machine, so that the virtual machine will automatically form a function similar to bonding for the virtio-net and VF network cards. At this time, there are two network Data Plane for the virtual machine on the Host.

The tap device of virtio-net backend is automatically added to the OpenvSwitch bridge of Host when the virtual machine starts, and the datapath also needs to switch when the virtual machine network card is switched. After VF attach to the virtual machine, replace VF_repr with tap device on OpenvSwitch bridge

2. VXLAN encap/decap cannot offload

Next, you need to adapt to the SmartNIC side. Take Mellanox CX5 network card as an example, the software environment includes OpenvSwitch-2.10.0, ukernel-4.14 and MLNX_OFED-4.4-1.0.0.0. Since the latest version of mlx5_coredriver does not support Ethernet over GRE tunnel offload, we first tested it through VXLAN.

As shown in the figure below, eth3 is the PF and mlx_0 is the representor of VF0, which is initialized with the following command. First, open a VF device, unbind the VF device on the driver mlx5_core, set the IP address of the PF device, set the switched mode related to the PF network card, and turn on the encap function of the PF network card.

The OpenvSwitch configuration is as follows: the virtual machine VF connects to the br0 using representor mlx_0 and sends it to the peer through vxlan0. The local address of the VXLAN tunnel is 172.168.152.75 and the peer address is 172.168.152.208.

Encap/decap messages can be sent and received effectively, but there is no offload to the network card:

The dmesg display error is first found:

After querying the reason, it is found that OpenvSwitch did not register the vxlan dport information into the network card when creating the vxlan device. OpenvSwitch usually does this through the netdev_ops- > ndo_add_vxlan_port interface of vxlan device, but in newer kernels such as ukernel-4.14, it is done through the netdev_ops- > ndo_udp_tunnel_add interface.

Later we submitted the patch "datapath: support upstream ndo_udp_tunnel_add in net_device_ops" https://patchwork.ozlabs.org/patch/953417/ to OpenvSwitch to solve this problem.

3. Decap message cannot be offload

After solving the above problems, although the encap message in the egress direction can effectively offload, but the ingress decap message is still not.

Case2's vxlan decap printing is on mlx_0 VF, so we speculate that decap rules may also be sent to VF port. Because the tc rule is set on the virtual device of vxlan_sys, it is likely that there is a problem on the physical Nic that is looking for the setting.

Through code analysis, we can see that the physical Nic of the virtual device is found through action device, that is, mlx_0 VF, and the tc_flower sent to mlx_0 VF by OpenvSwitch carries the flag that egress_dev is true. From this, it can be inferred that the TC rule is set on the PF corresponding to VF.

Following this inference, we looked at the code backports/0060-BACKPORT-drivers-net-ethernet-mellanox-mlx5-core-en_.patch of mlx5 driver

It is found that ukernel-4.14 can support cls_flower- > egress_devflag, but not HAVE_TCTO.

NETDEV_EGRESS_DEV . Therefore, we conclude that there is something wrong with mlx5_core driver in judging kernel compatibility. Subsequently, we submitted the corresponding patch to Mellanox to solve this problem.

4. Backend tap device encap message is discarded

When doing live migration, we need to use backend tap sdevice,OpenvSwitch to set the tc rule to tap device when sending a message, rely on the in_sw mode of tc for tunnel_key set and then forward it to gre_sys device for sending, but gre_sys device discards the message directly, which makes us very surprised.

After analyzing the reason, we find that in the case of tc offload's in_sw, the message will bypass the forwarding logic of OpenvSwitch and be sent directly through gre_sysdevice. However, we use the kernel module code that comes with OpenvSwitch-2.10.0. When compiling kernel module compatibility, we judge that ukernel-4.14 does not support USE_UPSTREAM_TUNNEL. Therefore, gre_sys device is not a gre device that comes with the kernel, but a device created by OpenvSwitch that does not have a nodo_start_xmit function. The forwarding of OpenvSwitch kernel gre tunnel is not really sent through gre_sys device.

Although ukernel-4.14 does not support USEUPSTREAM

TUNNEL, but for the kernel's native gredevice, it can support nodo_start_xmit sending through ip_tunnel_key, so for the kernel's native gredevice, the USE_UPSTREAM_TUNNEL flag is valid.

Therefore, OpenvSwitch can judge through the acinclude.m4 file.

Because OpenvSwitch determines that this function is based on gre and erspan, but ukernel-4.14 for erspan, the USE_UPSTREAM_TUNNEL flag is invalid.

After that, we introduce the upstream https://patchwork.ozlabs.org/

Cover/848329/ patch series "ERSPAN version 2 (type III) support" enables the OpenvSwitch-aware kernel to support USE_UPSTREAM_TUNNEL to solve the problem of gre_sys device drop messages.

5. Ethernet over gre tunnel cannot offload

After breaking into Mellanox to provide ethernet over gre-based patch, we found that the decap direction of ingress cannot be offload.

This is due to the fact that tc ingress qdisc,OpenvSwitch is not generated on gre_sys device to get the ifindex setting tc rule of device through vport's get_ifinex, while gre tunnel type's vport does not have enable get_ifindex function.

We looked up upstream's OpenvSwitch and solved this problem through patch "netdev-vport: Make gre netdev type to use TC rules".

In addition, the message of egress encap offload can not be received by the other party. By grabbing the packet, it is found that gre header contains csum field, but the gre tunnel on OpenvSwitch does not set csum options.

The set action of the research code cls_tunne_key is with csum field by default, and csum filed must be turned off only through the displayed setting of TCA_TUNNEL_KEY_NO_CSUM. OpenvSwicth-2.10.0 did not do this adaptation.

We looked up upstream's OpenvSwitch and finally solved this problem through patch "netdev-tc-offloads: TC csum option is notmatched with tunnel configuration".

In summary, we introduce in detail the selection scheme of UCloud 25G SmartNIC, as well as various technical problems encountered in practice and their solutions. Through the functional completion and bugfix of ukernel, OpenvSwitch and mlx5_core driver, we successfully apply this open source scheme.

Performance comparison

After landing the application, we carried out a series of performance tests from the dimensions of vSwitch performance and virtual network card performance under the scheme of high-performance 25G intelligent network card unloaded by OpenvSwitch. You can see

The reception performance of a single VF can reach 15Mpps:

The forwarding performance of the entire vSwitch is packet 24Mpps:

In general, in the traditional pure software environment, the forwarding performance of vSwitch is 2Mpps, and the receiving performance of virtual network card is only about 1.5Mpps. Compared with the original scheme, the overall performance of the network card has been improved by more than 10 times.

When applied to a CVM, for a CVM with the same 8-core configuration, taking the scenario of receiving UDP packets (1 Byte) as an example, the PPS value of the new solution can reach 469w, while the original value is 108w.

Follow-up plan

At present, the solution has been successfully applied to the public cloud and will be launched as a network enhancement 2.0 CVM, making the network capability of the CVM more than 4 times that of the current network enhancement version 1.0. In the future, we plan to migrate this solution to the Bare Metal physical CVM product, make the public cloud and physical CVM consistent in function and topology, and study the Offload of stateful Firewall/NAT.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.