What is the message receiving process of Linux network protocol stack? 07/04 Update SLTechnology News&Howtos

What is the message receiving process of Linux network protocol stack?

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "what is the message receiving process of Linux network protocol stack". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

If RPS is not enabled, _ _ netif_receive_skb will be called, and then _ _ netif_receive_skb_core will be called, which will basically enter Protocol Layer.

If RPS is enabled, there is still a long way to go. Call enqueue_to_backlog first to put the packet into the Backlog of CPU. The queue length is checked before joining the queue, and if the queue length is greater than the value set by net.core.netdev_max_backlog, the packet is dropped. At the same time, the flow limit is checked, and the packet is dropped if it exceeds it. Discarded words will be recorded in / proc/net/softnet_stat. When joining the queue, it will also check whether the NAPI processing backlog logic of the target CPU is running. If it is not running, it will set the target CPU to process the backlog logic through _ _ napi_schedule. The Inter-process Interrupt is then sent to wake up the target CPU to process the data within the CPU backlog.

CPU handles backlog in the same way that CPU calls driver's poll function to pull Ring Buffer data, registering a poll function, except that this "poll" function is process_backlog here and is registered by the relevant subsystem of the operating system network at startup. There is a loop in process_backlog, like driver's poll, constantly pulling data out of backlog for processing. Call _ _ netif_receive_skb, which in turn calls _ _ netif_receive_skb_core, which is the same logic as when RPS is turned off. It will also exit the loop according to the budget to determine how long it takes to process. Like the previous budget configuration that controls the execution time of netif_rx_action, budget is also controlled by the system configuration of net.core.netdev_budget.

Net.core.netdev_max_backlog

As mentioned above, when you put a packet into the backlog of CPU, you need to see how many packets are currently backlog in the queue, and discard the data when it exceeds the net.core.netdev_max_backlog. So you can adjust this value as needed:

Sysctl-w net.core.netdev_max_backlog=2000

It should be noted that in many places, it is recommended to increase this value when doing stress testing, but from our above analysis, we can see that this value is only useful when RPS is enabled, and setting this value is meaningless if RPS is not enabled.

Flow Limit

If a Flow or connection data is particularly large and the speed of sending data is fast, it may occur that the packets of the Flow fill up all the Backlog of the CPU, resulting in some packets with small amount of data but high delay requirements can not be processed quickly. So there is a Flow Limit mechanism to enable when the queue is more serious, to restrict Large Flow and favor small flow, so that small flow data can be processed as soon as possible, not affected by Large Flow.

This mechanism is independent of each CPU, and each CPU does not affect each other. You can see later that enabling this mechanism can also be enabled separately for a CPU. The principle is that when RPS is enabled and Flow Limit is enabled, by default, when more than half of the backlog of CPU is occupied, the Flow Limit mechanism starts to operate. This CPU will count the 256Packet of the Last. If the Packet of a Flow accounts for more than half of the 256Packet, the Flow will be restricted. All the new Packet of the Flow will be discarded, and the other Flow will be put into the Backlog for normal processing. Restricted Flow connections continue to be maintained, but packet loss increases.

Each CPU is assigned a Hash table when Flow Limit is enabled. When calculating the proportion for each Flow, some information in the Packet is extracted for Hash when the Packet is received and mapped to this Hash table. Hash Function is the same as the Hash Function used to find CPU for Packet under the RPS mechanism. The value in the Hash table is a Counter, which records how many Packet are queued for this Flow in the current Backlog. You can see that the size of the Hash table is limited, and its size can be configured. If the configuration is too small and the current machine carries a lot of Flow, there will be multiple different Flow Hash to the same Counter, so False Positive may occur. But generally good, because the average machine at the same time to deal with Flow will not be very much, multiple CPU can deal with more Flow at the same time.

The first step to enable Flow Limit is to set the size of the Hash table used by Flow Limit:

Sysctl-w net.core.flow_limit_table_len=8192

The default value is 4096.

After that, you need to enable Flow Limit for a single CPU. The order of the two configurations cannot be incorrect:

Echo f > / proc/sys/net/core/flow_limit_cpu_bitmap

This is similar to the configuration that enables RPS, but also a bitmap to identify which CPU enables Flow Limit. If you want all CPU enabled, set a larger value, no matter how many CPU can be overwritten.

Drop packet statistics

If the packet is discarded because of insufficient backlog or insufficient flow limit, the packet loss information will be counted in / proc/net/softnet_stat. We can also see if there is any packet loss here:

Cat / proc/net/softnet_stat930c8a79 00000000 0000270b 00000000 00000000 00000000 00000000280178c6 00000000 00000001 00000000 00000000 00000000 0cbbd3d4 00000000

One CPU row of data. But the trouble is that there is no clear document on what each column means, and different versions of kernel may print different data. You need to see how the softnet_seq_show function is printed. Generally speaking, the second column is the number of packets lost.

Seq_printf (seq, "x x x x x x x x x x x\ n", sd- > processed, sd- > dropped, sd- > time_squeeze, 0,0,0,0,0, / * was fastroute * / sd- > cpu_collision, sd- > received_rps, flow_limit_count)

Time_squeeze is the number of times that net_rx_action execution was stopped because there was not enough budget. This shows that there are many packets but the budget is small, and increasing the budget helps to process packets faster.

Cpu_collision is the number of times CPU tried to grab driver's lock when sending a message.

Received_rps is the number of times that CPU is awakened by Inter-processor Interrupt to process Backlog data. In the above example, we can see that only CPU1 has been awakened, because only one Ring Buffer,IRQ in this NIC is processed by CPU0, so when RPS is turned on, CPU0 sends the data to CPU1's Backlog and then wakes up CPU1.

Flow_limit_count indicates the number of times flow limit has been touched

Internet Protocol Layer

As mentioned earlier, no matter whether RPS is turned on or off, the packet is passed to the upper layer through _ _ netif_receive_skb_core. The implementation of delivering packets to pcap,tcpdump before incoming is based on libcap, and the reason why libcap can capture all packets is in _ _ netif_receive_skb_core. The specific location is: http://elixir.free-electrons.com/linux/v4.4/source/net/core/dev.c#L3850

You can see that it is still in softirq's handler at this time, so the tool tcpdump must extend the processing time of softirq to some extent.

After that, the ptype_base linked list is traversed in _ _ netif_receive_skb_core to find out the packet_type in the Protocol Layer that can handle the current packet and then process the data. All protocols that can handle link layer packets are registered with ptype_base. In the case of ipv4, inet_init is executed during initialization, and here you see that ip_packet_type is constructed and dev_add_pack is executed. Ip_packet_type means ipv4. When you enter dev_add_pack, you can see that ip_packet_type is added to the linked list pointed to by ptype_head. Here, ptype_head fetches ptype_base.

Going back to ip_packet_type, we see that it is defined as:

Static struct packet_type ip_packet_type _ _ read_mostly = {.type = cpu_to_be16 (ETH_P_IP), .func = ip_rcv,}

If it is found in _ _ netif_receive_skb_core that the protocol corresponding to sk_buff is ETH_P_IP, the func function under ip_packet_type, namely ip_rcv, will be executed, and the packet will be handed over to Protocol Layer for processing.

Ip_rcv

Ip_rcv can see that the logic is relatively simple, basically doing various checks and doing some data preparation for the transport layer. Finally, if all kinds of checks can pass, NF_HOOK is executed. If there is a check but the packet needs to be dropped, it will return NET_RX_DROP and will then count the drop of the packet.

NF_HOOK is amazing, it's actually HOOK to something called Netfilter, where you can filter packets and make some changes to packets according to various rules. If 1 is returned after HOOK execution, it means that Netfilter allows the packet to continue to be processed. If ip_rcv_finish,HOOK does not return 1, it will return the result of Netfilter, and the packet will not be processed.

Ip_rcv_finish is responsible for finding the routing destination from the IP Route System for the sk_buff, and if it is routed to the local machine, the corresponding socket needs to be found from the sk_buff within the next protocol that handles the sk_buff (such as the upper TCP/UDP protocol). This means that each packet received will have two demux (demultiplexing) operations (once to find where the packet should be routed, and once to route the packet to the corresponding Socket if it is routed to the local machine). But for protocols like TCP, when the socket is in the ESTABLISHED state, the protocol stack will not change, and the routing path of the later data packet is exactly the same as that of the handshake, so there is an Early Demux mechanism, which is used to find the upper layer network protocol according to the protocol field in the IP Header when receiving the packet, and use the upper layer network protocol to parse the routing path of the packet in order to reduce a query. Take TCP as an example, simply speaking, after receiving the packet, go to the TCP layer to find out if the packet has a corresponding Socket in the ESTABLISHED state, and some directly use the routing destination where the Socket already lives in Cache as the routing destination of the current Packet. As a result, you don't have to look for IP Route System, because you can't save finding Socket according to Packet.

The details are that TCP registers its handlers in the inet_protos of the IP layer when the IP layer is initialized. Among these handlers registered by TCP is the early_demux function tcp_v4_early_demux. In tcp_v4_early_demux, we can see that it is mainly based on the source addr and dest addr information of sk_buff to find the Socket to which the current packet belongs from the ESTABLISHED connection list, and get the sk_rx_dst in Socket, that is, struct dst_entry, which is the routing path cached by the current Socket and set to sk_buff. The sk_buff is then routed to the location referred to by the sk_rx_dst. In addition to the routing information, the struct sock pointer to the found Socket is stored in the sk_buff so that the packet does not have to look up the connection list repeatedly when it is routed to the TCP layer.

If the Socket in ESTABLISHED state is not found, it will follow the same path as when IP Early Demux is not turned on. You'll see later that TCP's newly created Socket reads the dst_entry settings from sk_buff to struct sock's sk_rx_dst. The sk_rx_dst in struct sock is here: linux/include/net/sock.h-Elixir-Free Electrons.

If the IP Early Demux does not work, for example, the current sk_buff may be the first packet of the Flow, and the Socket is not in the ESTABLISHED state, so the Early Demux cannot be performed without finding the Socket. You need to call ip_route_input_noref to process the IP Route System to find out who should handle the sk_buff, whether it is handled by the current machine, or whether it should be forwarded. This routing mechanism looks complicated, so it's no wonder that you need an Early Demux mechanism to omit this step. If IP Route System looks around and finds that the sk_buff really needs to be processed by the current machine, it will eventually set the function pointed to by dst_entry to ip_local_deliver.

It needs to be added that Early Demux is not valid for TCP connections where Socket is not in the ESTABLISHED state. This causes the packet to be checked not only once in IP Route System but also in the TCP ESTABLISHED connection table, and then routed to the TCP layer to check the Socket table again. The overall cost will be greater than just checking the IP Route System once. Therefore, Early Demux is not free, but most scenarios may improve performance when opened, so Linux is enabled by default. However, in some scenarios, it should be a scenario of a large number of short connections. If connections are continuously established and disconnected, a large number of packets cannot be found in the TCP ESTABLISHED table, and the performance of this mechanism will be degraded when this mechanism is enabled. Therefore, Linux provides a way to disable this mechanism:

Sysctl-w net.ipv4.ip_early_demux=0

Some people have tested that this mechanism will cause a maximum loss of 5% in a particular scenario: https://patchwork.ozlabs.org/patch/166441/

Both the Early Demux and the query IP Route System are used to set up the dst_entry in the sk_buff and skip to the next function responsible for processing the sk_buff through dst_entry. This jump is done by the last dst_input of ip_rcv_finish. The dst_input implementation is simple:

Return skb_dst (skb)-> input (skb)

It is the struct dst_entry that was constructed before reading from the sk_buff, executed the function pointed to by the input in it, and handed in the sk_buff.

If sk_buff is sent to the current machine, both Early Demux and query IP Route System will eventually go to ip_local_deliver.

Ip_local_deliver

Do three things:

Determine whether there is an IP Fragment. If so, save the sk_buff first and return it directly, and then assemble the subsequent packets.

Send packets to Netfilter for filtering through the same NET_HOOK as in ip_rcv

If the packet is filtered out, the packet is directly discarded and returned. If it is not filtered, the ip_local_deliver_finish will be executed eventually.

The protocol field in the IP Header will be taken out in the ip_local_deliver_finish, and the upper layer protocol handling function registered during the initialization of the IP layer will be found in the inet_protos mentioned above. In the case of TCP, the TCP registration information is here: linux/net/ipv4/af_inet.c-Elixir-Free Electrons. Ip_local_deliver_finish calls the registered handler function, which is tcp_v4_rcv for TCP.

The IP layer updates a lot of counts during data processing, which you can see in the snmp.h file. Basically, the statistics shown in proc/net/netstat with the word IP are defined in this file.

This is the end of the content of "what is the message receiving process of the Linux network protocol stack". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.