In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)06/03 Report--
The following brings you how to solve the DPDK problem of load balancing products in the production environment. I hope it can bring some help to you in practical application. There are many things involved in load balancing, not many theories, and there are many books on the Internet. Today, we will use the accumulated experience in the industry to do an answer.
ULB4 is a highly available four-tier load balancing product based on DPDK independently developed by UCloud, with forwarding capacity close to line speed, while DPDK is a high-performance open source data-side development suite. As the global entrance of user application, it is very important for ULB4 to ensure the continuous stability of user business in the scenario of large traffic diversification, which is also the technical mission of UCloud network product team. In particular, the bandwidth of a single ULB cluster in the current network has reached 10G, the number of packets is 830000 PPS, and the running environment is complex. Even in the face of unexpected factors (such as triggering unknown BUG), we should try to ensure the normal operation of the product to avoid serious impact.
Recently, we found a DPDK packet exception in the online environment of ULB4. Because the entire ULB product is a cluster architecture, this exception does not cause the user service to be unavailable. However, in order to ensure the stability of user service at any time, the team captures abnormal messages from GB-level traffic in the current network by means of GDB, message export tool, and traffic mirroring in production environment. Combined with DPDK source code analysis, the team locates the BUG caused by DPDK itself and fixes it. During this period, there is no impact on users' business, which further ensures the stable operation of tens of thousands of ULB instances of UCloud.
This article will start with the problem phenomenon, peel off the cocoon, and describe in detail the whole process of problem positioning, analysis and solution, hoping to provide reference and inspiration for ULB users and DPDK developers.
Problem background
Disaster recovery suddenly appeared in the stable ULB4 cluster in early December, and an abnormal operation of a ULB4 CVM was automatically removed from the cluster. The phenomenon at that time was:
The forwarding service monitors that the traffic in the receiving direction of the Nic is normal, but the traffic in the sending direction is 0, and it can be sent and received normally after restarting the forwarding service. At the same time, other machines in the cluster will also experience irregular anomalies. For the user business, there will be a small amount of slight connection jitter, and then quickly recover.
The following is the whole process of dealing with the problem, we made a variety of attempts in this process, and finally completed the analysis and solution with the DPDK source code, and then prepared to open source and share the self-developed message export tool.
Problem orientation and analysis
The ULB4 cluster has been working steadily, and suddenly the same problem occurs on different machines in the cluster, and after the machines are back in the cluster, the same problem occurs again after a period of time. Based on our operational experience, the initial guess is that some kind of abnormal message triggered the program BUG. However, in the face of GB-level traffic, how to capture abnormal messages? How to find out the problem without affecting the business?
First, GDB debug messages and find doubtful points.
If you want to know why the whole program is not sent, the best way is to be able to enter the program to see the specific implementation process. For DPDK user-mode programs, GDB is obviously a useful tool. We set breakpoints in the package logic and view the function's execution logic through the disassemble command. There are more than 700 lines after disassembly. (many of the functions called in this function are decorated with inline, resulting in a large number of instructions after assembly.)
Combined with the source code of the corresponding DPDK version, a single instruction is executed step by step. After many attempts, it is found that each time it will be returned directly in the place shown in the following figure.
The general flow is that i40e_xmit_pkts () will call i40e_xmit_cleanup () to clean up the queue when it is sent and finds that the send queue is full. After sending a packet, the network card in DPDK will write back a specific field, indicating that the message has been sent, and the driver can check the field to know whether the message has been sent. The problem here is that the driver believes that the messages in the queue have never been sent by the network card, and the subsequent messages will not be added to the queue and will be discarded directly.
At this point, the direct reason has been found, that is, for some reason, the network card does not send packets or fails to write back specific fields correctly, which causes the driver to think that the sending queue is always in a state of full queue, and cannot add subsequent messages to the sending queue.
So why is the queue full? Is the exception package relevant? With this question in mind, we made a second attempt.
Second, one button to restore the network card message
The queue is full, and the following messages have not been added, indicating that the messages in the queue have been stuck there at this time. Since we speculate that there may be an exception message, is it possible that the exception message is still in the queue? If we can export all the messages in the current queue, we can further verify our guess.
Based on the in-depth study of DPDK, we export the message according to the following steps.
If we look at the i40e_xmit_pkts () function, we will find that the first parameter is the send queue, so we can get the information about the queue.
As shown in the following figure, when you first enter the breakpoint, look at the register information to get the corresponding parameters of the function.
When we print the message of the queue, we find that there is no symbol information. At this time, we can load the i40e_rxtx.o generated at compile time as shown in the following figure to get the corresponding symbol information.
After getting the queue information, we use GDB's dump command to export all the messages in the whole queue in the order of the queue, and name each message according to the sequence number.
At this time, the exported message is still the original message, we can not use wireshark to easily view the message information. To do this, as shown in the following figure, we wrote a simple gadget using the libpcap library to convert it into an pcap file that wireshark can parse.
Sure enough, as shown in the following figure, all exported messages contain a message with a length of 26 bytes but a content of all zero. This message looks very abnormal and seems to initially verify our guess:
In order to improve the speed of exporting messages when troubleshooting problems, we have written a message one-click export tool, which can export all messages with one click in case of an exception and convert them to pcap format.
After exporting the message many times, we find a rule: each time there will be a 26-byte but all-zero message, and there will be a message of the same length in front of it, and each time the source IP address network segment comes from the same region.
3. Traffic image to confirm the exception packet
The conclusion of the second step makes the whole troubleshooting a big step forward, but the queue packet is processed by a series of programs and is not really the original business message. If you don't stop until you reach your goal, you still have to mirror the packet at a critical moment, so that night, you should urgently contact your network operation and maintenance colleagues to configure port-mirroring (port image) on the switch to mirror the traffic destined for the ULB4 cluster to an idle server to capture the packet. Of course, the image server needs to be configured as follows:
Set the Nic promiscuous mode to collect mirror traffic (ifconfig net2 promisc).
Turn off the GRO function (ethtool-K net2 gro off) to receive the most original messages and prevent the GRO function of Linux from assembling the messages in advance.
According to the regional characteristics of abnormal IP, we targeted to capture some of the traffic of the source IP segment.
Reference command: nohup tcpdump-I net2-s0-w% Y%m%d_%H-%M-%S.pcap-G 1800 "proto gre and (ip [54:4] & 0x11223000) = = 0x11223000) or ((ip [58:4] & 0x11223000) = = 0x11223000))" &
After many attempts, the failure occurred. After layers of stripping and screening, the following message was found:
This is an IP fragment message, but the strange thing is that the second piece of IP fragment only has an IP header. After careful comparison, the two messages together are the two connected messages in the export queue. The last 26 bytes are exactly the same as the full zero message.
We know that in the TCP/IP protocol, if an IP message is sent longer than MTU, IP fragmentation will be triggered and will be split into several small fragment messages to be sent. Under normal circumstances, all fragments must carry data. But this fragmented message is very abnormal, the total length of the message is 20, that is to say, there is only one IP header, followed by no longer carry any information, such a message is meaningless. This message is also filled with a 26-byte 0 after passing through the switch because its length is too short.
At this point, we finally found this abnormal message, and basically verified our guess. However, it is also necessary to actually verify whether it is caused by such an abnormal message. (from the point of view of the interaction of the whole message, this message was originally set as a non-slicing TCP message, but it was forced to allow sharding after passing through a public network gateway, and slicing took this abnormal form. )
IV. Solution
If it is really caused by this abnormal message, then just check the abnormal message when receiving the packet and discard it. So we modify the DPDK program to discard this kind of message. As a verification, an online server was released first, and there was no abnormal disaster recovery after one day of operation. Now that the root cause of the problem has been found, it is this kind of abnormal message that leads to the abnormal operation of DPDK, which can be released according to grayscale throughout the network.
Fifth, DPDK community feedback
In a responsible attitude towards the open source community, we are prepared to synchronize BUG with the DPDK community. After comparing the latest commit, we found a commit submitted on November 6th, and the situation is exactly the same, as follows:
Ip_frag: check fragment length of incoming packet
This has been fixed in the latest release of DPDK 18.11, which is consistent with our processing logic and discards the exception message.
Review and summary
After dealing with all the problems, we began to do an overall review.
1. Summary of the causes of ULB's failure to send contracts.
The whole generation process of ULB4 unable to send a package is as follows:
DPDK receives the first slice of the fragment message and caches it for subsequent fragmentation
In the second piece, only the abnormal fragment of the IP header arrives, and the DPDK handles it according to the normal message processing logic without checking and discarding it, so the rte_mbuf structure of the two messages is chained together to form a chained message to be returned to ULB4.
After such a message is received by ULB4, because the total length of the whole message does not reach the length that needs to be sliced, ULB4 directly calls the sending interface of DPDK to send it.
DPDK does not check this abnormal message, but directly calls the corresponding user-state network card driver to send the message directly.
The user-mode network card driver triggers the network card tx hang when sending such an exception message.
After the tx hang is triggered, the network card no longer works, and the transmission descriptor corresponding to the message in the driver queue is no longer correctly set by the network card to send the completion flag.
Subsequent messages continue to arrive, begin to backlog in the sending queue, and eventually fill the entire queue, and then the message will be discarded directly when it arrives.
2. Why does the abnormal message trigger the network card tx hang
First of all, let's take a look at the code related to sending messages on the network card in DPDK.
As we can see from the above figure, it is very important to set the relevant fields correctly according to the Datasheet of the network card. If the settings are wrong for some reason, it may lead to unpredictable consequences (refer to the Datasheet of the network card).
As shown in the figure below, the corresponding fields are usually described in the Datasheet corresponding to the Nic, and there is usually a corresponding data structure in the Nic driver.
After having a basic understanding, we guess that if we manually construct such a similar exception message directly in the program, will it also cause the network card to fail to send packets?
The answer is yes.
As shown in the following figure, we use such code snippets to form an exception message, then call the DPDK API to send it directly, and soon the network card will tx hang.
Third, thinking about direct operation of hardware.
It needs to be very careful to operate the hardware directly. In the traditional Linux system, the driver is generally managed by the kernel in the kernel state, and all kinds of exception handling may be carried out in the driver code, so it is rare for the hardware not to work due to the operation of the user program. DPDK can operate hardware directly in user mode because of its own characteristics of user-mode driver. At the same time, in order to improve performance, many optimizations may be carried out. If the user's own program deals with problems, it may lead to abnormal situations such as network card tx hang.
IV. The value of tools
We have written a tool to export DPDK driver queue messages with one click, so that we can quickly export all messages in the network card driver sending queue every time there is a problem, which greatly improves the efficiency of troubleshooting. After further optimization, this tool is ready to open source on UCloud GitHub, hoping to be helpful to DPDK developers.
Write at the end
As an open source suite, DPDK usually has no problems with stability and reliability, but the actual application scenarios are ever-changing, and some special cases may lead to abnormal work of DPDK. Although the probability of occurrence is very small, DPDK is usually at the key gateway location, and once a problem occurs, even a rare problem will have a serious impact.
Therefore, the technical team understands its working principle and analyzes its source code, and can locate the existing problems of DPDK step by step with specific phenomena, which is of great significance to improve the service reliability of the whole DPDK program. It is worth mentioning that ULB4's high-availability cluster architecture plays an important role in dealing with this problem. When one is not available, other machines in the cluster can continue to provide reliable services for users, effectively improving the reliability of users' business.
After reading the above about how to solve the DPDK problem of load balancing products in the production environment, if there is anything else you need to know, you can find what you are interested in in the industry information or find our professional and technical engineer to solve it. Technical engineers have more than ten years of experience in the industry.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.