In addition to Weibo, there is also WeChat
Please pay attention

WeChat public account
Shulou
 
            
                     
                
2025-10-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
This article shows you how the Linux network protocol stack receives messages. The content is concise and easy to understand. It will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.
I'd like to see if I can fully sort out the process of receiving the message. From NIC receiving data to triggering soft interrupts, delivering packets to IP layer and then to TCP layer via routing mechanism, and finally delivering user processes. Will try to introduce a variety of configuration information in the process of receiving messages, as well as a variety of monitoring data. After knowing the complete process of receiving messages, understanding various configurations and understanding all kinds of monitoring data, it is possible to optimize the configuration in the future work.
The process of receiving messages related to Ring Buffer is roughly as follows:
The function name of raise softirq has been changed to napi_schedule
NIC (network interface card) registers its own information with the system during startup, and the system allocates Ring Buffer queues and a special kernel memory area to NIC for storing transmitted packets. Struct sk_buff is a memory interface dedicated to storing all kinds of network transmission packets. After receiving the data and storing it in the NIC dedicated kernel memory area, there is a data pointer in the sk_buff that points to this memory. The Packet Descriptor stored in the Ring Buffer queue has two states: ready and used. Initially, Descriptor is empty, pointing to an empty sk_buff, in the ready state. When data is available, DMA is responsible for fetching data from NIC, and sequentially finds the Descriptor of the next ready on Ring Buffer, stores the data in the sk_buff that the Descriptor points to, and marks the slot as used. Because ready slots are found sequentially, Ring Buffer is a queue of FIFO.
When DMA has finished reading the data, NIC triggers an IRQ for CPU to process the received data. Because every time after triggering IRQ, CPU takes time to process Interrupt Handler. If each Packet received by NIC triggers an IRQ, it will cause CPU to spend a lot of time processing Interrupt Handler, and after processing, only one Packet can be taken out from Ring Buffer. Although the Interrupt Handler execution time is very short, it is also very inefficient and will bring a lot of burden to CPU. So at present, a mechanism called New API (NAPI) is used to merge IRQ to reduce the number of IRQ.
Next, let's take a look at how NAPI does IRQ merging. It mainly allows the driver of NIC to register a poll function, and then the subsystem of NAPI can pull the received data from Ring Buffer in batches through the poll function. The main events and their sequence are as follows:
When NIC driver initializes, register the poll function with Kernel, which is used to pull the received data from Ring Buffer later.
Driver registration enables NAPI. This mechanism is turned off by default. Only driver that supports NAPI will enable it.
After receiving the data, NIC stores the data in memory through DMA.
NIC triggers an IRQ and triggers CPU to start executing the Interrupt Handler of driver registration
Driver's Interrupt Handler uses the napi_schedule function to trigger softirq (NET_RX_SOFTIRQ) to wake up NAPI subsystem,NET_RX_SOFTIRQ. The handler is executed in another thread, where the poll function registered by driver is called to get the received Packet.
Driver disables the IRQ of the current NIC so that there is no new IRQ until all the data is poll.
When everything is done, NAPI subsystem will be disabled and NIC's IRQ will be re-enabled
Go back to step three.
From the above description, we can see that there is still something missing. How is the data on the Ring Buffer delivered to the upper network stack to be processed after it is left by poll? And how is the consumed sk_buff reallocated and put back into the Ring Buffer?
Both of these tasks are done in poll, and as mentioned above, poll is a function implemented by driver, so each driver implementation may be different. But poll's work is basically the same:
Read the received sk_buff from Ring Buffer
Doing some basic checks on sk_buff may involve merging several sk_buff because the same Frame may be scattered across multiple sk_buff
Deliver the sk_buff to the upper network stack for processing
Clean up the sk_buff, clean the Descriptor on the Ring Buffer, point it to the newly assigned sk_buff and set the status to ready
Update some statistics, such as the number of packet received, the total number of bytes, etc.
If you take the implementation of intel igb as a network card, its poll function is here: linux/drivers/net/ethernet/intel/igb/igb_main.c-Elixir-Free Electrons
First of all, we see tx.ring and rx.ring, which means that both sending and receiving messages will come here. Don't worry about sending a message, read and receive the message first, and it is igb_clean_rx_irq who receives it. After receiving the message, execute napi_complete_done to exit polling mode and open the IRQ of NIC. So we know that most of the work is done in igb_clean_rx_irq, and its implementation is generally clear, just a few steps described above. There is a while loop controlled by buget, so when there is a lot of Packet, do not let the CPU loop here indefinitely, so that other things can be performed. The things done within the loop are as follows:
First batch clean up the read sk_buff and assign a new buffer to avoid cleaning one sk_buff each time you read it, which is very inefficient
Find the next Descriptor that needs to be read in Ring Buffer and check if the descriptor status is normal
According to Descriptor, find sk_buff and read it.
Check whether it is End of packet. Yes, it means that sk_buff contains all the contents of Frame. If not, it means that Frame data is larger than sk_buff. You need to read another sk_buff to merge the two sk_buff data.
Check the integrity of Frame data through Frame's Header, whether it is correct or not
Record the length of the sk_buff and how much data have been read
Set up Hash, checksum, timestamp, VLAN id, and other information, which is provided by the hardware.
Deliver sk_buff to the upper network stack through napi_gro_receive
Update a bunch of statistics
Back to 1, exit the loop if there is no data or there is not enough budget
Seeing that budget affects the time it takes for CPU to execute poll, the larger the budget, the higher the CPU utilization and reduces packet latency when there are a large number of packets. But spending all CPU time here will affect the execution of other tasks.
Budget defaults to 300, and you can adjust sysctl-w net.core.netdev_budget=600
Napi_gro_receive will involve the GRO mechanism, which will roughly aggregate multiple packets later, and napi_gro_receive will eventually send the processed sk_buff to the upper network stack by calling netif_receive_skb. After executing the GRO, we can basically assume that the packet has officially left the Ring Buffer and moved on to the next phase. Before recording the next phase of processing, add more details about the Ring Buffer in the receiving phase.
Generic Receive Offloading (GRO)
GRO is an implementation of Large receive offload. Most of the MTU on the network is 1500 bytes. When Jumbo Frame is turned on, it can reach 9000 bytes. If the data sent exceeds the MTU, it needs to be cut into multiple packets. LRO is to merge multiple packets of the same Flow to the upper layer for processing according to certain rules when receiving multiple data packets, so that the number of packets to be processed by the upper layer can be reduced.
Many LRO mechanisms are implemented on NIC, and NIC without implementing LRO lacks the ability to merge packets mentioned above. GRO is the software implementation of LRO, which enables all NIC to support this feature.
Napi_gro_receive is used to merge multiple packets when they are received. If the received packets need to be merged, napi_gro_receive will return quickly. When the merge is complete, napi_skb_finish is called to release data structures that are no longer used because of packet merging. Eventually, the netif_receive_skb will be called to deliver the packet to the upper network stack for further processing. As mentioned above, netif_receive_skb is the entrance to the upper network stack after the packet comes out of Ring Buffer.
You can view and set GRO through ethtool:
Check GROethtool-k eth0 | grep generic-receive-offloadgeneric-receive-offload: on setting to enable Ring Buffer processing (Receive Side Scaling) under GROethtool-K eth0 gro on multi-CPU
The IRQ generated when the NIC receives the data can only be processed by one CPU, so only one CPU will execute napi_schedule to trigger the softirq, and the handler of the triggered softirq will still be executed on the CPU that generates the softIRQ. So the poll function of driver is also executed on the CPU that initially handles the IRQ issued by NIC. So there is only one CPU pulling data on a Ring Buffer at the same time.
From the above description, we can see that the space allocated to Ring Buffer is limited. When the packet rate received is greater than the processing speed of a single CPU, the Ring Buffer may be full, and new packets will be discarded automatically. Now that machines have multiple CPU, and it is inefficient to have only one CPU to process Ring Buffer data, there is a mechanism called Receive Side Scaling (RSS) or multiqueue to deal with this problem. WIKI's introduction to RSS is good, concise and able to take a look at: Network interface controller-Wikipedia
To put it simply, there are multiple Ring Buffer,NIC in the Nic that supports RSS. When you receive the Frame, you can use Hash Function to decide which Ring Buffer the Frame should be placed on. The triggered IRQ can also assign the IRQ to multiple CPU through the operating system or manually configuring the IRQ. In this way, IRQ can be processed by different CPU, so that the data on Ring Buffer can also be processed by different CPU, thus improving the parallel processing ability of data.
RSS does not affect any logic other than which CPU NIC sends IRQ to. The process of receiving messages is the same as described before.
If RSS is supported, NIC allocates an IRQ to each queue, which can be viewed through / proc/interrupts. You can configure IRQ affinity to specify which CPU will handle interrupts in IRQ. After finding the IRQ number through / proc/interrupts, write the CPU number you want to bind to / proc/irq/IRQ_NUMBER/smp_affinity, which is hexadecimal bit mask. For example, if you see that the interrupt number corresponding to the queue rx_0 is 41, then execute:
Echo 6 > / proc/irq/41/smp_affinity6 represents CPU2 and CPU1
The mask of CPU 0 is 0x1 (0001), CPU 1 is 0x2 (0010), CPU 2 is 0x4 (0100), CPU 3 is 0x8 (1000), and so on.
It is also important to note that if you set smp_affinity, you cannot enable irqbalance or you need to set the-banirq list for irqbalance, excluding IRQ with smp_affinity set. Otherwise, the irqbalance mechanism will ignore the IRQ affinity configuration you set up when it works.
Receive Packet Steering (RPS) is a mechanism for implementing similar functions of RSS in software when NIC does not support RSS. The advantage is that there is no requirement for NIC, and any NIC can support RPS, but the disadvantage is that after NIC receives the data, the DMA will store the data in a Ring Buffer,NIC trigger IRQ or send it to a CPU, or will this CPU call driver's poll to extract the Ring Buffer data. RPS does not work until a single CPU fetches the data from the Ring Buffer. After calculating the Hash for each Packet, it sends the Packet to the backlog of the corresponding CPU and processes the backlog by telling the target CPU through Inter-processor Interrupt (IPI). The subsequent Packet processing flow is completed by this target CPU. In order to achieve the purpose of dividing the load among multiple CPU.
RPS is disabled by default. When the machine has multiple CPU and finds that NET_RX is unevenly distributed on the CPU through the statistics / proc/softirqs of softirqs, or when the network card does not support mutiqueue, you can consider enabling RPS. The value of / sys/class/net/DEVICE_NAME/queues/QUEUE/rps_cpus needs to be adjusted to enable RPS. For example, execute:
Echo f > / sys/class/net/eth0/queues/rx-0/rps_cpus
What it means is that the number of CPU of the rx-0 queue that processes the network card eth0 is set to f. That is, there are 15 CPU to process the data of the rx-0 queue. If you do not have so many CPU, all CPU will be used by default. Some people even write echo fff > / sys/class/net/eth0/queues/rx-0/rps_cpus directly into the script for convenience, so that basically all types of machines can be covered, regardless of the number of machine CPU. This allows the script to be executed on any machine.
Note: if NIC does not support mutiqueue,RPS, you can't open it without thinking at all, because it will increase the burden on all CPU, and it may not bring benefits in some scenarios, such as CPU-intensive applications. So we need to test it.
Receive Flow Steering (RFS) generally works with RPS. RPS sends the received packet to different CPU to achieve load balancing, but the packet of the same Flow may be being processed by CPU1, but the next packet is sent to CPU2, which will reduce the CPU cache hit ratio and cause the packet to be sent from CPU1 to CPU2. RFS is to ensure that the packet of the same flow will be routed to the CPU that is processing the current Flow data, thus increasing the CPU cache ratio. This article introduces the RFS mechanism very well. Basically, after receiving the data, make a Hash according to some information of the data and find the CPU information that is currently processing the flow in the entry of the table, so as to send the data to the CPU that is processing the Flow data, so as to improve the CPU cache hit rate and avoid copying the data between different CPU. Of course, there are many details, please see the link above.
RFS is off by default and must be actively configured to take effect. Normally, when you enable RPS, you have to turn on RFS again to get better performance. This article also says how to turn on RFS and the recommended configuration values. One is to configure rps_sock_flow_entries.
Sysctl-w net.core.rps_sock_flow_entries=32768
This value depends on the number of active connections expected by the system. Note that the number of active connections at the same time will normally be much less than the maximum number of connections the system can carry, because most connections will not be active at the same time. The recommended value is 32768, which covers most cases, and each active connection is assigned an entry. In addition to this, configure rps_flow_cnt, which is the maximum number of flow for each queue. If there is only one queue, the rps_flow_cnt is generally the same as the value of rps_sock_flow_entries, but when there are multiple queues, the rps_flow_cnt value is rps_sock_flow_entries / N, and N is the number of queues.
Echo 2048 > / sys/class/net/eth0/queues/rx-0/rps_flow_cnt
Accelerated Receive Flow Steering (aRFS), like RFS, is assisted by hardware to do the job. ARFS for RFS is the same as RSS for RPS, is to move the work of CPU to the hardware to do, so that do not waste CPU time, NIC directly completes the Hash value calculation and sends the data to the target CPU, so hurry up. NIC must expose a function of ndo_rx_flow_steer to implement aRFS.
Adaptive RX/TX IRQ coalescing
Some NIC supports this feature, which is used to dynamically merge IRQ to reduce packet latency when there are fewer packets and improve throughput when there are many packets. View method:
Ethtool-c eth2Coalesce parameters for eth2:Adaptive RX: off TX: offstats-block-usecs: 0.
Enable adaptive coalescing execution of RX queue:
Ethtool-C eth0 adaptive-rx on
And there are four values to set: rx-usecs, rx-frames, rx-usecs-irq, rx-frames-irq, the specific meaning and so on when you need to check.
Ring Buffer related monitoring and configuration received packet statistics ethtool-S eh0NIC statistics: rx_packets: 792819304215 tx_packets: 778772164692 rx_bytes: 172322607593396 tx_bytes: 201132602650411 rx_broadcast: 15118616 tx_broadcast: 2755615 rx_multicast: 0 tx_multicast: 10
RX is to receive data, and TX is to send data. It also shows how each queue in NIC sends and receives messages. The key ones are the statistics with the word drop and the statistics of fifo_errors:
Tx_dropped: 0rx_queue_0_drops: 93rx_queue_1_drops: 874....rx_fifo_errors: 2142tx_fifo_errors: 0
See the number of packets for the send queue and receive queue drop shown here. And all the queue_drops adds up to rx_fifo_errors. So on the whole, you can see whether there is packet loss on the Ring Buffer through rx_fifo_errors. On the one hand, it depends on whether you need to adjust the allocation of data in each queue, or whether you want to increase the size of the Ring Buffer.
/ proc/net/dev is another packet-related statistic, but this statistic is ugly:
Cat / proc/net/devInter- | Receive | Transmit face | bytes packets errs drop fifo frame compressed multicast | bytes packets errs drop fifo colls carrier compressed lo: 14472296365706 10519818839 000 014472296365706 10519818839000000 eth2: 164650683906345 785024598362 02142 2000 183711288087530 704887351967 0000000 adjust the number of Ring Buffer queues ethtool-l eth0Channel parameters for eth0:Pre-set maximums:RX: 0TX: 0Other: 1Combined: 8Current hardware settings:RX: 0TX: 0Other: 1Combined: 8
Look at the Combined column is the number of queues. Combined writes multi-function queues according to the instructions. Guess it can be used as RX queues or TX queues, but the total number is 8?
If mutiqueue is not supported, the above execution will be as follows:
Channel parameters for eth0:Cannot get device channel parameters: Operation not supported
See that the number of Ring Buffer above has maximums and current settings, so you can set the number of Ring Buffer by yourself, but the maximum cannot exceed the value of maximus:
Sudo ethtool-L eth0 combined 8
If you support setting the number of queues for a specific type of RX or TX, you can do this:
Sudo ethtool-L eth0 rx 8
It is important to note that the setting operation of ethtool may have to be restarted before it takes effect.
Resize Ring Buffer queu
Check the current Ring Buffer size first:
Ethtool-g eth0Ring parameters for eth0:Pre-set maximums:RX: 4096RX Mini: 0RX Jumbo: 0TX: 4096Current hardware settings:RX: 512RX Mini: 0RX Jumbo: 0TX: 512
See that the maximum of RX and TX is 4096, and the current value is 512. The larger the queue, the less likely to lose packets, but the data delay will increase.
Set the RX queue size:
Ethtool-G eth0 rx 4096 adjusts the weight of Ring Buffer queues
If NIC supports mutiqueue, NIC distributes the received packets according to a Hash function. The weight of different queues can be adjusted to allocate data.
Ethtool-x eth0RX flow hash indirection table for eth0 with 8 RX ring (s): 0: 0 000 000 0 8: 0 000 000 0 16: 1 1 1. 64: 4 4 4 72: 4 4 4 80: 5 5 5. 120: 7 7 7
My NIC has 8 queues, one of which has 128 different Hash values, which lists what the queue corresponding to each Hash value is. The leftmost 0 8 16 is to allow you to quickly find a specific Hash value. For example, if the Hash value is 76, we can immediately find the 72 line: "72: 44 44 44". From left to right, the first is 72, the fifth is 76, and the queue corresponding to the Hash value is 4.
Ethtool-X eth0 weight 6 2 8 5 10 7 1 5
Set the weight of 8 queues. It cannot add up to more than 128. 128is the indirection table size, and each NIC may be different.
Change Ring Buffer Hash Field
The packet is allocated according to a field within the packet, which can be adjusted.
Ethtool-n eth0 rx-flow-hash tcp4TCP over IPV4 flows use these fields for computing Hash flow key:IP SAIP DAL4 bytes 0 & 1 [TCP/UDP src port] L4 bytes 2 & 3 [TCP/UDP dst port]
View the Hash field of tcp4.
You can also set the Hash field:
Ethtool-N eth0 rx-flow-hash udp4 sdfn
Sdfn needs to look at ethtool to see what it means, and there are many other configuration values.
Softirq statistics
You can see the statistics of the number of softirq on each CPU through / proc/softirqs:
Cat / proc/softirqs CPU0 CPU1 HI: 10 TIMER: 1650579324 3521734270 NET_TX: 10282064 10655064 NET_RX: 3618725935 2446 BLOCK: 0 0BLOCK_IOPOLL: 00 TASKLET: 47013 41496 SCHED: 1706483540 1003457088 HRTIMER: 1698047 11604871 RCU: 4218377992 3049934909
The NET_RX is the softirq triggered when receiving the message. Generally speaking, this statistic is to see whether the softirq is evenly distributed on each CPU. If it is uneven, you may need to make some adjustments. For example, as seen above, there is a big gap between CPU0 and CPU1, because the NIC of this machine does not support RSS and does not have multiple Ring Buffer. When RPS is turned on, it is much more uniform.
IRQ statistics
/ proc/interrupts can see the IRQ statistics for each CPU. In general, it is to see if NIC supports multiqueue and whether the IRQ merge mechanism of NAPI works. See if IRQ is growing fast.
The above is what the process of receiving messages on Linux network protocol stack is like. Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

The market share of Chrome browser on the desktop has exceeded 70%, and users are complaining about

The world's first 2nm mobile chip: Samsung Exynos 2600 is ready for mass production.According to a r


A US federal judge has ruled that Google can keep its Chrome browser, but it will be prohibited from

Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope





 
             
            About us Contact us Product review car news thenatureplanet
More Form oMedia: AutoTimes. Bestcoffee. SL News. Jarebook. Coffee Hunters. Sundaily. Modezone. NNB. Coffee. Game News. FrontStreet. GGAMEN
© 2024 shulou.com SLNews company. All rights reserved.