What is DPDK? 07/19 Update SLTechnology News&Howtos

What is DPDK?

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly introduces "what is DPDK". In daily operation, I believe many people have doubts about what is DPDK. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful for you to answer the doubts about "what is DPDK?" Next, please follow the editor to study!

I. the situation and trend of Network IO

From the use of our users, we can feel that the network speed has been improving, and the development of network technology has also evolved from the evolution of 1GE/10GE/25GE/40GE/100GE, from which it can be concluded that the network IO capability of a stand-alone computer must keep up with the development of the times.

1. Traditional telecom field

Devices at and below the IP layer, such as routers, switches, firewalls, base stations, and so on, use hardware solutions. Based on dedicated network processor (NP), there are FPGA-based and more ASIC-based. But the disadvantage based on hardware is very obvious, Bug is not easy to repair, difficult to debug and maintain, and network technology has been developing, such as 2G/3G/4G/5G and other mobile technology innovation, these belong to the business logic based on hardware implementation is too painful, can not be iterated quickly. The challenge in the traditional field is that there is an urgent need for a high-performance network IO development framework with software architecture.

two。 The development of cloud

The emergence of private cloud has become a trend through network functional virtualization (NFV) shared hardware. The definition of NFV is to achieve a variety of traditional or new network functions through standard servers and standard switches. There is an urgent need for a high-performance network IO development framework based on common systems and standard servers.

3. A surge in stand-alone performance

With the development of network card from 1G to 100G and CPU from single core to multi-core to multi-CPU, the stand-alone capacity of the server has reached a new high point through horizontal expansion. However, software development can not keep up with the pace, stand-alone processing capacity can not match the hardware, how to develop high-throughput services that keep pace with the times, and millions of concurrent capabilities on a single machine. Even if some businesses do not have high requirements for QPS, mainly CPU-intensive, but now big data analysis, artificial intelligence and other applications need to transfer a large amount of data between distributed servers to complete jobs. This should be our Internet background development should be the most concerned, but also the most relevant.

If you want to know more, welcome to join the group 973961276 to exchange and study together, and there are a lot of learning materials to share with the interview experience of the big factory.

Second, IO bottleneck of Linux + x86 network

A few years ago, I wrote the article "working principle of Network Card and tuning under High concurrency", which described the message sending and receiving process of Linux. According to experience, running on C1 (8 cores) requires 1% soft interrupt CPU for every 1W packet processing, which means that the upper limit of a single machine is 1 million PPS (Packet Per Second). The performance of TGW (Netfilter version) has been optimized from 1 million PPS,AliLVS to 1.5 million PPS, and the configuration of the server they use is relatively good. Suppose we want to run full 10GE network cards with 64 bytes per packet, which requires 20 million PPS. (note: the speed limit of Ethernet 10 Gigabit network cards is 14.88 million PPS, because the minimum frame size is 84B "Bandwidth, Packets Per Second, and Other Network Performance Metrics"). 100G is 200 million PPS, that is, the processing time of each packet cannot exceed 50 nanoseconds. And a Cache Miss, whether it is TLB, data Cache, instruction Cache Miss, back to memory read about 65 nanoseconds, NUMA system cross-Node communication about 40 nanoseconds. So, even without business logic, even pure receiving and dispatching packets is so difficult. We need to control the hit rate of Cache, we need to understand the computer architecture, and cross-Node communication cannot occur.

From these data, I hope I can directly feel how big the challenge here is, ideal and realistic, and we need to balance it. There are all these problems.

1. In the traditional way of sending and receiving messages, hard interrupts must be used to communicate, and each hard interrupt takes about 100 microseconds, not counting the Cache Miss brought by terminating the context.

two。 Data must switch copies from kernel mode to user mode, resulting in a large amount of CPU consumption and global lock competition.

3. Both sending and receiving packets have the overhead of system calls.

4. The kernel works on multiple cores, and for global consistency, even if Lock Free is adopted, the performance loss caused by lock bus and memory barrier can not be avoided.

5. From the network card to the business process, the path is too long, some are actually unnecessary, such as the netfilter framework, which bring some consumption, and easy to Cache Miss.

III. The basic principles of DPDK

From the previous analysis, we can see that the way IO is implemented, the bottleneck of the kernel, and the uncontrollable factors of data flowing through the kernel are all implemented in the kernel. The kernel is the cause of the bottleneck. To solve the problem, we need to bypass the kernel. Therefore, the mainstream solution is bypass network card IO, bypassing the kernel and directly sending and receiving packets in the user mode to solve the kernel bottleneck.

Linux community also provides bypass mechanism Netmap, official data 10G network card 14 million PPS, but Netmap is not widely used. There are several reasons:

1.Netmap needs driver support, that is, it needs to be approved by the network card manufacturer.

2.Netmap still relies on the interrupt notification mechanism and does not completely solve the bottleneck.

3.Netmap is more like several system calls to send and receive packets directly in user mode. The function is too primitive to form a dependent network development framework, and the community is not perfect.

So, let's take a look at DPDK, which has been developed for more than ten years. From Intel-led development to the addition of Huawei, Cisco, AWS and other major manufacturers, core players are in this circle, have a perfect community, and form a closed-loop ecology. In the early days, the applications below three layers in the traditional telecommunications field, such as Huawei, China Telecom and China Mobile, were its early users, and switches, routers and gateways were the main application scenarios. However, with the needs of the upper business and the improvement of DPDK, higher applications are gradually emerging.

DPDK Bypass principle:

The picture is quoted from Jingjing Wu's document "Flow Bifurcation on Intel ®Ethernet Controller X710/XL710".

On the left is the original way of data from Nic-> driver-> protocol stack-> Socket interface-> business.

On the right is the DPDK approach, based on UIO (Userspace Izod O) bypass data. Data from Nic-> DPDK polling mode-> DPDK basic library-> business

The advantage of user mode is easy to use, development and maintenance, and good flexibility. And Crash does not affect the operation of the kernel, strong robustness.

CPU architectures supported by DPDK: x86, ARM, PowerPC (PPC)

List of network cards supported by DPDK: https://core.dpdk.org/supported/, our mainstream use of Intel 82599 (optical port), Intel x540 (electric port)

4. UIO, the cornerstone of DPDK

In order to make the driver run in user mode, Linux provides UIO mechanism. Using UIO, you can sense interrupts through read and communicate with network cards through mmap.

UIO principle:

There are several steps to develop a user-mode driver:

1. Develop UIO modules that run in the kernel because hard interrupts can only be handled in the kernel

two。 Read interrupt via / dev/uioX

3. Share memory through mmap and peripherals

Fifth, DPDK core optimization: PMD

The UIO driver of DPDK shields the hardware from sending interrupts, and then adopts active polling in the user mode, which is called PMD (Poll Mode Driver).

UIO bypasses the kernel, actively polls to remove hard interrupts, and DPDK can send and receive packets in user mode. It brings the benefits of Zero Copy and no system calls, and synchronous processing reduces the Cache Miss brought by context switching.

Core running on PMD will be in the state of user-mode CPU100%

When the network is idle, the CPU is idle for a long time, which will lead to energy consumption problems. Therefore, DPDK launched the Interrupt DPDK mode.

Interrupt DPDK:

The picture is quoted from David Su/Yunhong Jiang/Wei Wang's document "Towards Low Latency Interrupt Mode DPDK".

Its principle is very similar to NAPI, that is, it goes to sleep when there is no packet to deal with, and interrupts notification instead. And can share the same CPU Core with other processes, but the DPDK process will have a higher scheduling priority.

6. High performance code implementation of DPDK

1. Using HugePage to reduce TLB Miss

By default, Linux uses 4KB as a page. The smaller the page, the larger the memory, the greater the overhead of the page table, and the greater the memory footprint of the page table. CPU has the high cost of TLB (Translation Lookaside Buffer), so it can only store hundreds to thousands of pages of table items. If the process uses 64 gigabytes of memory, 64G/4KB=16000000 (16 million) pages, each page takes up 16000000 * 4B=62MB in the page table entries. If you use 2MB as a page with HugePage, you only need 64G/2MB=2000, and the quantity is not at the same level.

DPDK uses HugePage to support the page size of 2MB and 1GB under x86-64, which geometrically reduces the size of page table items, thus reducing TLB-Miss. It also provides basic libraries such as memory pool (Mempool), MBuf, unlocked ring (Ring), Bitmap, etc. According to our practice, in the data plane (Data Plane) frequent memory allocation release, must use memory pool, can not directly use rte_malloc,DPDK memory allocation implementation is very crude, not as good as ptmalloc.

2. SNA (Shared-nothing Architecture)

The software architecture is decentralized, avoids global sharing as far as possible, brings global competition, and loses the ability of horizontal expansion. Memory is not used remotely across Node in NUMA system.

3. SIMD (Single Instruction Multiple Data)

Capabilities have been growing from the earliest mmx/sse to the latest avx2,SIMD. DPDK uses batches to process multiple packages at the same time, and then uses vector programming to process all packages in one cycle. Memcpy, for example, uses SIMD to increase speed.

SIMD is common in the game backend, but if other businesses have scenarios similar to batch processing, you can also see if they can be satisfied to improve performance.

4. Do not use slow API

Here we need to redefine the slow API, such as gettimeofday, although there is no need to fall into the kernel state through vDSO under 64-bit, it is just a pure memory access that can reach tens of millions per second. However, don't forget that under 10GE, we have tens of millions of processing power per second. So even gettimeofday belongs to slow API. DPDK provides Cycles interfaces, such as rte_get_tsc_cycles interfaces, based on HPET or TSC implementations.

Use the RDTSC instruction under x86-64 to read directly from the register. You need to input 2 parameters, which is a common implementation:

Static inline uint64_trte_rdtsc (void) {uint32_t lo, hi; _ _ asm__ volatile__ ("rdtsc": "= a" (lo), "= d" (hi)); return ((unsigned long long) lo) | ((unsigned long long) hi) > 24))

5) use CPU instruction

Modern CPU provides many instructions to perform common functions directly, such as size-to-size conversion, and x86 is directly supported by bswap instructions.

Static inline uint64_t rte_arch_bswap64 (uint64_t _ x) {register uint64_t x = _ x; asm volatile ("bswap% [x]": [x] "+ r" (x)); return x;}

This implementation, which is also the implementation of GLIBC, starts with constant optimization, CPU instruction optimization, and finally using bare code. After all, they are all top programmers, and they have different pursuits of language, compiler and implementation, so you must understand the wheel well before building the wheel.

Google's open source cpu_features can get what features are supported by the current CPU to perform optimizations for a particular CPU. High-performance programming is endless, and the understanding of hardware, kernel, compiler, and development language should be in-depth and keep pace with the times.

VII. DPDK ecology

For our Internet background development, the capability provided by the DPDK framework itself is still relatively naked. For example, if you want to use DPDK, you must achieve basic functions such as ARP and IP layer, which is difficult to get started with. If you want to use higher-level services, you also need user-mode transport protocol support. Direct use of DPDK is not recommended.

At present, the application layer development projects with sound ecology and strong community (supported by front-line manufacturers) are FD.io (The Fast Data Project), VPP supported by Cisco open source, and relatively perfect protocol support, such as ARP, VLAN, Multipath, IPv4/v6, MPLS and so on. The user-mode transport protocol UDP/TCP has TLDK. From project positioning to community support is a relatively reliable framework.

Tencent Cloud's open source F-Stack is also worth paying attention to. It is easier to develop and directly provides POSIX interface.

Seastar is also powerful and flexible, switching between kernel mode and DPDK at will, and also has its own transport protocol Seastar Native TCP/IP Stack support. However, there are no large-scale projects using Seastar, and there may be more holes to be filled.

Our GBN Gateway project needs to support L3/IP layer access to do Wan gateway, stand-alone 20GE, based on DPDK development.

At this point, the study of "what is DPDK" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.