Using DDP Technology to improve the performance of Tungsten Fabric vRouter 07/19 Update SLTechnology News&Howtos

Using DDP Technology to improve the performance of Tungsten Fabric vRouter

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

At the just-concluded "2020 Virtual developer and Test Forum", Kiran KN, an engineer from Juniper Network, and colleagues introduced a set of performance improvements (supported by Intel DDP technology) on the Tungsten Fabric data plane. The following are the highlights of the forum technology sharing:

VRouter as a DPDK application

Before diving into DDP technology, let's first introduce vRouter, what it is, and its place in the entire Tungsten Fabric framework.

In fact, vRouter can be deployed on a regular X86 server or in a compute node of OpenStack or K8s. VRouter is the main data plane component, and there are two deployment modes, vRouter:kernel module and vRouter:DPDK.

This use case will involve DPDK applications and vRouter before using DPDK to improve performance. The responsibility of vRouter is the data plane for packet forwarding and packet forwarding programmed by the vRouter agent on the compute node, but in fact, the entire configuration is provided through the XMPP on the controller. We use XMPP to communicate with the controller through vRouter agent, and there is a special interface to program the vRouter data plane to forward the package. In DPDK, vRouter is a high-performance, multi-core and multi-threaded application. Here I would like to emphasize that it is dedicated to multi-core DPDK applications, and we need to find the correct use of multi-core. We can see from the example that the Nic has the same number of queues as vRouter, and a core has been assigned to the packet or link. First, packets need to be evenly distributed by the network card to all router forwarding cores. To do this, an algorithm with a 5-tuple hash is used to distribute traffic correctly among all kernels. Moreover, proper load balancing is based on packets, and the packet needs to have a 5-tuple, multiple source destination port IP address. If the protocol is correct, it can determine the average distribution of traffic among all kernels, and we can use it to assign performance to all kernels of vRouter DPDK. Although this traffic packet is distributed across the processing cores, they can be appropriately placed in the virtual machine through the TX interface queue. If this traffic is not properly balanced between kernels, vRouter will rebalance without a kernel, but it is expensive for the kernel to rehash them through the kernel, which means that they consume CPU cycles and introduce additional latency. This is the problem we face, but it has been largely solved today, and we can expect the network card to complete this task and properly balance the traffic between the vRouter cores.

That is, computing types with MPLSoGRE do not have sufficient entropy (entropy). Due to the existence of entropy, there is not enough information in our header, or the load of the packet can not be balanced correctly. Specifically, the packet should have complete port information, including five complete elements, namely, source IP, destination IP, source port, destination port and protocol. But in MPLSoGRE there are only source IP, destination IP, and protocol triple information. Therefore, the network card can not properly balance the load of packets, because of the CPU kernel, all packets between a pair of computing nodes fall in the same area in the bottleneck, which will cause the network card queue to become the bottleneck of the whole computing, and the performance will be affected. For example, suppose there are thousands of flows between a pair of compute nodes. Ideally, we want to distribute the flow evenly across all kernels so that it can be picked up by different CPU for packet processing. But for MPLSoGRE, the information of known ports does not have enough entropy, all packets from specific computing nodes will occur, and even if there is a lot of traffic, the network card will not distribute them to all queues. So, instead of actually spreading packets across multiple cores like this, as we know, packets are allocated to only one kernel.

Therefore, although there are many CPU cores, it is essentially a bottleneck because all packets have to go through C1. Packets cannot flow directly through C2, C3, and C4 because they are not loaded on the hardware, and all other kernels must get packets from C1, which will obviously be overloaded.

Introduce DDP to eliminate bottleneck

We have introduced a new feature for the Tungsten Fabric data plane that eliminates the bottleneck of MPLSoGRE packets and makes performance proportional to the number of CPU cores. This means that no single CPU core becomes a bottleneck, and the network card hardware distributes packets equally among all CPU cores. Our solution is powered by Intel DDP (dynamic device personalization) technology and is provided using Ethernet 700 series products. After Intel moved to the programmable pipeline model, ensuring that they introduced features such as firmware upgradeability, DDP allows dynamic reconfiguration of the packet processing pipeline in the network card at run time without restarting the server. The software can apply custom profiles to the network card, and we can treat these profiles as attachments that can be built by the end user. These profiles can be refreshed to the network card by software so that they can begin to identify and classify new packet types online and assign these packets to different Rx queues.

This is the case of MPLSoGRE implementation, first of all, this is a MPLSoGRE packet without DDP extraction, it does not get the actual internal startup information heard in the current packet, so it does not have enough information to allocate the packet correctly. With DDP enabled in the second illustration, the configuration file can begin to recognize internal packet headers, as well as internal IP headers and internal UDP headers, so it can start using this information to calculate hashes. How do I make DDP the way end users need to create profiles for their packet types? You can create it by using Intel's profile Editor (profile editor) tool, where Intel publishes some standard profiles that can be downloaded directly from the Intel website. The profile editor can be used to create a new entry or to modify an existing entry for the parser, which is step 1. Step 2, create a new profile for the MPLSoGRE packet that defines the structure of the packet header on different layers. The third step is to compile and create a binary package that can be applied to the network card. Step 4, we can use DPDK API to load these configuration files into the network card on each tool interface. Next step 5, the Nic will be able to recognize MPLSoGRE packets.

Testing and confirmation of performance improvement

Next, we need to test and confirm that DDP can help bring performance improvements. First, our testing framework is widely used to develop and test vRouter. We use the proxy approach to have the ability to quickly prove concepts and collect information. All methods of testing vRouter performance, including encapsulation and encapsulation between compute nodes, and always include the overlay network, as you can see in the illustration, we use a third object (rapid jump VM), which only controls traffic (not through any other traffic), sends instructions to VM and collects information. In order to achieve the test goal, we use a binary search with a packet loss rate of only 0.001%, using a standard test framework and specifications. In vRouter, we can find scripts that display statistics for each kernel CPU packet processing to prove that the network card correctly centralizes traffic between all kernels. These statistics come from the VM0 interface, which means that the interface between the connection and the physical Nic is connected. On the left, you can see that kernel 1 is not processing packages, which actually means that the kernel does not receive any packages processed by vRouter. This kernel is just busy polling packages and distributing packages among other available kernels on vRouter. This means that the vRouter will become a bottleneck because all traffic entering the vRouter needs to be pulled out of the Nic queue and then redistributed (across other cores) to be forwarded to the VM. On the right, you can see that the network card using DDP has allocated traffic correctly, and the traffic between all cores in the Rx queue is almost equal. Prove that the network card has completed its work and distributed the traffic evenly. You can see the statistical differences in the performance results with or without DDP. There is no benefit from using DDP with no more than three kernels-- because the network card is currently fast enough to process such queues for polling the kernel and redistributing traffic across the kernel. But once you increase the number of kernels and then improve the overall performance, the Nic becomes a bottleneck-- performance doesn't improve without DDP, even if you increase the number of kernels, because there is always a kernel pulling traffic, and you can see that around 6.5mpps is the maximum for a kernel to poll from the Nic queue in the part without DDP.

As the number of kernels increases, each kernel receives the same amount of traffic from the network card. Once the number of cores exceeds 6, the revenue will become higher. As we can see, the gain with six cores is about 73%, which is really a cool number. Not only can performance be improved, but latency can also be better reduced by using DDP. This is because we do not need to balance traffic between kernels, nor do we need to calculate the hash of each packet. On vRouter, this will be done by the network card, with an average increase of 40% and a maximum of 80%, which is already a great number.

To sum up, for use cases with multiple cores, we can gain a lot from DDP technology. In addition, for 5G use cases, it is important that DDP can reduce latency. In all cases where we want to use MPLSoGRE, with vRouter, we are ready to deploy 5G applications in multicore. [download pdf documents related to this article]

Https://tungstenfabric.org.cn/assets/uploads/files/tf-vrouter-performance-improvements.pdf

[video link]

Https://v.qq.com/x/page/j3108a4m1va.html

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.