Illustrating the packet receiving process in Linux network 07/01 Update SLTechnology News&Howtos

Illustrating the packet receiving process in Linux network

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

This article comes from the official account of Wechat: developing Internal skills practice (ID:kfngxl). Author: Zhang Yanfei allen

Because it is necessary to provide a variety of network services to millions, tens of millions, or even more than 100 million users, one of the key requirements for interviewing and promoting back-end developers in front-line Internet enterprises is to be able to support high concurrency and understand performance overhead. Performance optimization will be carried out. In many cases, if you do not have a deep understanding of the bottom layer of Linux, you will encounter a lot of online performance bottlenecks and you will feel that the dog has no way to start with the hedgehog.

Today, we use a graphical way to have a deep understanding of the receiving process of network packets under Linux. It's customary to borrow the simplest piece of code and start thinking. For simplicity, let's use udp as an example, as follows:

Int main () {int serverSocketFd = socket (AF_INET, SOCK_DGRAM, 0); bind (serverSocketFd,); char buff [BUFFSIZE]; int readCount = recvfrom (serverSocketFd, buff, BUFFSIZE, 0,); buff [readCount] ='\ 0transactions transaction printf ("Receive from client:%s\ n", buff);} the above code is a piece of logic for udp server to receive receipts. When viewed from the development perspective, as long as the client has the corresponding data sent, the server can receive it after executing recv_from and print it out. What we want to know now is, what happened when the network packet arrived at the network card until our recvfrom received the data?

Through this article, you will gain an in-depth understanding of how the Linux network system is implemented and how the various parts interact with each other. I believe this will be of great help to your work. This article is based on Linux 3.10, the source code can be found in https://mirrors.edge.kernel.org/ pub / linux / kernel / v3.x, and the network card driver uses Intel's igb network card as an example.

Friendly hint, this article is a little longer, you can Mark first and then read!

1. Overview of Linux Network packet receiving in the TCP / IP network hierarchical model, the whole protocol stack is divided into physical layer, link layer, network layer, transport layer and application layer. The physical layer corresponds to the network card and network cable, and the application layer corresponds to our common Nginx,FTP and other applications. Linux implements three layers: link layer, network layer and transport layer.

In the Linux kernel implementation, the link layer protocol is implemented by the network card driver, and the kernel protocol stack implements the network layer and transport layer. The kernel provides a socket interface to the upper application layer for access by user processes. The TCP / IP network layering model we see from a Linux perspective should look like this.

Fig. 1 Network protocol stack from the perspective of Linux in the source code of Linux, the corresponding logic of the network device driver is located in driver / net / ethernet, where the driver of the intel series network card is in the driver / net / ethernet / intel directory. The protocol stack module code is located in the kernel and net directories.

Kernel and network device drivers are handled by interrupts. When data arrives on the device, a voltage change is triggered on the relevant pin of the CPU to inform the CPU to process the data. For the network module, because the processing process is complex and time-consuming, if all the processing is completed in the interrupt function, the interrupt handling function (too high priority) will overoccupy the CPU, and the CPU will not be able to respond to messages from other devices, such as mouse and keyboard. Therefore, the Linux interrupt handling function is divided into the upper part and the lower part. The first half is to do the simplest work, quickly process and then release the CPU, and then CPU can allow other interrupts to come in. The rest of the work will be put into the lower half, which can be dealt with slowly and calmly. The lower half of kernel version after 2.4is implemented by soft interrupt, which is handled by ksoftirqd kernel thread. Unlike hard interrupts, hard interrupts notify soft interrupt handlers by applying voltage changes to CPU physical pins, while soft interrupts notify soft interrupt handlers by giving the binary value of a variable in memory.

Well, after a rough understanding of network card drivers, hard interrupts, soft interrupts, and ksoftirqd threads, we give a path diagram for kernel packet collection based on these concepts:

Figure 2 Overview of Linux Kernel Network packet receiving when data is received on the network card, the first working module in Linux is the network driver. The network driver will write the frames received on the network card to memory in the way of DMA. Another interrupt is initiated to CPU to notify CPU that data has arrived. Second, when CPU receives an interrupt request, it will call the interrupt handling function registered by the network driver. The interrupt handling function of the network card does not do too much work, issue a soft interrupt request, and then release the CPU as soon as possible. Ksoftirqd detects the arrival of a soft interrupt request, and calls poll to start polling for receiving the packet. After receiving it, it is handed over to all levels of protocol stack for processing. For UDP packets, they are placed in the receive queue of the user socket.

From the figure above, we have grasped the overall processing of data packets by Linux. But if we want to know more about the details of the work of the network module, we have to move on.

Second, Linux starts the Linux driver, kernel protocol stack and other modules have to do a lot of preparatory work before they have to receive network card packets. For example, to create the ksoftirqd kernel thread in advance, to register the corresponding processing functions of each protocol, to initialize the network device subsystem in advance, and to start the network card. Only when these are all Ready can we really start receiving packets. So now let's take a look at how all these preparations are done.

2.1 the soft interrupts that create the ksoftirqd kernel thread Linux are done in a dedicated kernel thread (ksoftirqd), so it is important to look at how these processes are initialized so that we can understand the packet collection process more accurately later. The number of processes is not 1, but N, where N equals the number of cores on your machine.

Smpboot_register_percpu_thread is called in kernel / smpboot.c during system initialization, which further executes to spawn_ksoftirqd (located in kernel / softirq.c) to create the softirqd process.

Figure 3 the code related to creating the ksoftirqd kernel thread is as follows:

/ / file: kernel/softirq.cstatic struct smp_hotplug_thread softirq_threads = {.store = & ksoftirqd,.thread_should_run = ksoftirqd_should_run,.thread_fn = run_ksoftirqd,.thread_comm = "ksoftirqd/%u",}; static _ init int spawn_ksoftirqd (void) {register_cpu_notifier (& cpu_nfb); BUG_ON (smpboot_register_percpu_thread (& softirq_threads)); return 0;} early_initcall (spawn_ksoftirqd) When ksoftirqd is created, it goes into its own thread loop functions ksoftirqd_should_run and run_ksoftirqd. Constantly determine if there are any soft interrupts that need to be handled. One thing to note here is that soft interrupts are not only network soft interrupts, but also other types.

/ / file: include/linux/interrupt.henum {HI_SOFTIRQ=0,TIMER_SOFTIRQ,NET_TX_SOFTIRQ,NET_RX_SOFTIRQ,BLOCK_SOFTIRQ,BLOCK_IOPOLL_SOFTIRQ,TASKLET_SOFTIRQ,SCHED_SOFTIRQ,HRTIMER_SOFTIRQ,RCU_SOFTIRQ,}; 2.2 Network subsystem initialization

Figure 4 Network subsystem initialization the linux kernel initializes each subsystem by calling subsys_initcall. You can grep many calls to this function in the source code directory. What we are going to talk about here is the initialization of the network subsystem, which is executed to the net_dev_init function.

/ / file: net/core/dev.cstatic int _ init net_dev_init (void) {for_each_possible_cpu (I) {struct softnet_data * sd = & per_cpu (softnet_data, I); memset (sd, 0, sizeof (* sd)); skb_queue_head_init (& sd-input_pkt_queue); skb_queue_head_init (& sd-process_queue); sd-completion_queue = NULL;INIT_LIST_HEAD (& sd-poll_list) } open_softirq (NET_TX_SOFTIRQ, net_tx_action); open_softirq (NET_RX_SOFTIRQ, net_rx_action);} subsys_initcall (net_dev_init); in this function, a softnet_data data structure is applied for each CPU, and the poll_list in this data structure is waiting for the driver to register its poll function. We can see this process later when the Nic driver initializes.

In addition, open_softirq registers a handler for each soft interrupt. The handler function of NET_TX_SOFTIRQ is net_tx_action,NET_RX_SOFTIRQ, and the handler function is net_rx_action. Continue to track open_softirq and find that the way this registration is registered is recorded in the softirq_vec variable. Later, when the ksoftirqd thread receives a soft interrupt, it will also use this variable to find the corresponding handler for each soft interrupt.

/ / file: kernel/softirq.cvoid open_softirq (int nr, void (* action) (struct softirq_action *)) {softirq_ vector [nr]. Action = action;} 2.3 protocol stack registration kernel implements the ip protocol at the network layer, as well as the tcp protocol and udp protocol at the transport layer. The corresponding implementation functions of these protocols are ip_rcv (), tcp_v4_rcv () and udp_rcv (), respectively. Unlike the way we usually write code, the kernel is implemented through registration. Fs_initcall in the Linux kernel is similar to subsys_initcall in that it is also the entrance to the initialization module. Network protocol stack registration begins after fs_initcall calls inet_init. With inet_init, these functions are registered with the inet_protos and ptype_base data structures. As shown below:

Figure 5 the code related to the registration of AF_INET stack is as follows

/ / file: net/ipv4/af_inet.cstatic struct packet_type ip_packet_type _ _ read_mostly = {.type = cpu_to_be16 (ETH_P_IP), .func = ip_rcv,}; static const struct net_protocol udp_protocol = {.handler = udp_rcv,.err_handler = udp_err,.no_policy = 1m. Netns = 1,} Static const struct net_protocol tcp_protocol = {.early_demux = tcp_v4_early_demux,.handler = tcp_v4_rcv,.err_handler = tcp_v4_err,.no_policy = 1line .netnsroomok = 1,}; static int _ _ init inet_init (void) {.if (& icmp_protocol, IPPROTO_ICMP)

< 0)pr_crit("%s: Cannot add ICMP protocol\n", __func__);if (inet_add_protocol(&udp_protocol, IPPROTO_UDP) < 0)pr_crit("%s: Cannot add UDP protocol\n", __func__);if (inet_add_protocol(&tcp_protocol, IPPROTO_TCP) < 0)pr_crit("%s: Cannot add TCP protocol\n", __func__);......dev_add_pack(&ip_packet_type);}上面的代码中我们可以看到，udp_protocol 结构体中的 handler 是 udp_rcv，tcp_protocol 结构体中的 handler 是 tcp_v4_rcv，通过 inet_add_protocol 被初始化了进来。 int inet_add_protocol(const struct net_protocol *prot, unsigned char protocol){if (!prot-netns_ok) {pr_err("Protocol %u is not namespace aware, cannot register.\n",protocol);return -EINVAL;}return !cmpxchg((const struct net_protocol **)&inet_protos[protocol],NULL, prot) ? 0 : -1;}inet_add_protocol 函数将 tcp 和 udp 对应的处理函数都注册到了 inet_protos 数组中了。再看 dev_add_pack (&ip_packet_type); 这一行，ip_packet_type 结构体中的 type 是协议名，func 是 ip_rcv 函数，在 dev_add_pack 中会被注册到 ptype_base 哈希表中。 //file: net/core/dev.cvoid dev_add_pack(struct packet_type *pt){struct list_head *head = ptype_head(pt);}static inline struct list_head *ptype_head(const struct packet_type *pt){if (pt-type == htons(ETH_P_ALL))return &ptype_all;elsereturn &ptype_base[ntohs(pt-type) & PTYPE_HASH_MASK];}这里我们需要记住 inet_protos 记录着 udp，tcp 的处理函数地址，ptype_base 存储着 ip_rcv () 函数的处理地址。后面我们会看到软中断中会通过 ptype_base 找到 ip_rcv 函数地址，进而将 ip 包正确地送到 ip_rcv () 中执行。在 ip_rcv 中将会通过 inet_protos 找到 tcp 或者 udp 的处理函数，再而把包转发给 udp_rcv () 或 tcp_v4_rcv () 函数。扩展一下，如果看一下 ip_rcv 和 udp_rcv 等函数的代码能看到很多协议的处理过程。例如，ip_rcv 中会处理 netfilter 和 iptable 过滤，如果你有很多或者很复杂的 netfilter 或 iptables 规则，这些规则都是在软中断的上下文中执行的，会加大网络延迟。再例如，udp_rcv 中会判断 socket 接收队列是否满了。对应的相关内核参数是 net.core.rmem_max 和 net.core.rmem_default。如果有兴趣，建议大家好好读一下 inet_init 这个函数的代码。 2.4 网卡驱动初始化每一个驱动程序（不仅仅只是网卡驱动）会使用 module_init 向内核注册一个初始化函数，当驱动被加载时，内核会调用这个函数。比如 igb 网卡驱动的代码位于 drivers / net / ethernet / intel / igb / igb_main.c //file: drivers/net/ethernet/intel/igb/igb_main.cstatic struct pci_driver igb_driver = {.name = igb_driver_name,.id_table = igb_pci_tbl,.probe = igb_probe,.remove = igb_remove,};static int __init igb_init_module(void){ret = pci_register_driver(&igb_driver);return ret;}驱动的 pci_register_driver 调用完成后，Linux 内核就知道了该驱动的相关信息，比如 igb 网卡驱动的 igb_driver_name 和 igb_probe 函数地址等等。当网卡设备被识别以后，内核会调用其驱动的 probe 方法（igb_driver 的 probe 方法是 igb_probe）。驱动 probe 方法执行的目的就是让设备 ready，对于 igb 网卡，其 igb_probe 位于 drivers / net / ethernet / intel / igb / igb_main.c 下。主要执行的操作如下：图 6 网卡驱动初始化第 5 步中我们看到，网卡驱动实现了 ethtool 所需要的接口，也在这里注册完成函数地址的注册。当 ethtool 发起一个系统调用之后，内核会找到对应操作的回调函数。对于 igb 网卡来说，其实现函数都在 drivers / net / ethernet / intel / igb / igb_ethtool.c 下。相信你这次能彻底理解 ethtool 的工作原理了吧？这个命令之所以能查看网卡收发包统计、能修改网卡自适应模式、能调整 RX 队列的数量和大小，是因为 ethtool 命令最终调用到了网卡驱动的相应方法，而不是 ethtool 本身有这个超能力。第 6 步注册的 igb_netdev_ops 中包含的是 igb_open 等函数，该函数在网卡被启动的时候会被调用。 //file: drivers/net/ethernet/intel/igb/igb_main.cstatic const struct net_device_ops igb_netdev_ops = .ndo_open = igb_open,.ndo_stop = igb_close,.ndo_start_xmit = igb_xmit_frame,.ndo_get_stats64 = igb_get_stats64,.ndo_set_rx_mode = igb_set_rx_mode,.ndo_set_mac_address = igb_set_mac,.ndo_change_mtu = igb_change_mtu,.ndo_do_ioctl = igb_ioctl,第 7 步中，在 igb_probe 初始化过程中，还调用到了 igb_alloc_q_vector。他注册了一个 NAPI 机制所必须的 poll 函数，对于 igb 网卡驱动来说，这个函数就是 igb_poll, 如下代码所示。 static int igb_alloc_q_vector(struct igb_adapter *adapter,int v_count, int v_idx,int txr_count, int txr_idx,int rxr_count, int rxr_idx){netif_napi_add(adapter-netdev, &q_vector-napi,igb_poll, 64);}2.5 启动网卡当上面的初始化都完成以后，就可以启动网卡了。回忆前面网卡驱动初始化时，我们提到了驱动向内核注册了 structure net_device_ops 变量，它包含着网卡启用、发包、设置 mac 地址等回调函数（函数指针）。当启用一个网卡时（例如，通过 ifconfig eth0 up），net_device_ops 中的 igb_open 方法会被调用。它通常会做以下事情：图 7 启动网卡//file: drivers/net/ethernet/intel/igb/igb_main.cstatic int __igb_open(struct net_device *netdev, bool resuming){err = igb_setup_all_tx_resources(adapter);err = igb_setup_all_rx_resources(adapter);err = igb_request_irq(adapter);if (err)goto err_req_irq;for (i = 0; i < adapter->

Num_q_vectors; igb_open +) napi_enable (& (adapter- > qvector [I]-> napi));} above the _ _ vector function calls igb_setup_all_tx_resources, and igb_setup_all_rx_resources. In the igb_setup_all_rx_resources step, the RingBuffer is allocated and the mapping between memory and Rx queues is established. The number and size of Rx Tx queues can be configured through ethtool. Let's move on to the interrupt function registration igb_request_irq:

Static int igb_request_irqstruct igb_adapter * adapter) if (adapter-msix_entries) err = igb_request_msix (adapter); if (! err) goto request_done;}} static int igb_request_msix (struct igb_adapter * adapter) for (I = 0; I

< adapter-num_q_vectors; i++) err = request_irqadapter-msix_entries[vector].vector,igb_msix_ring, 0, q_vector-name,}在上面的代码中跟踪函数调用， __igb_open =>

Igb_request_irq = > igb_request_msix, as we can see in igb_request_msix, for multi-queue NICs, interrupts are registered for each queue, and the corresponding interrupt handling function is igb_msix_ring (this function is also under drivers / net / ethernet / intel / igb / igb_main.c). We can also see that in msix mode, each RX queue has an independent MSI-X interrupt, which can be set from the Nic hardware interrupt level to allow received packets to be processed by different CPU. The binding behavior to CPU can be modified through irqbalance or by modifying / proc/ irq / IRQ_NUMBER / smp_affinity.

When the above preparations are done, you can open the door to welcome guests (data packets)!

Welcome the arrival of data 3.1 hard interrupt processing first when the data frame arrives on the network card from the network line, the first station is the receiving queue of the network card. The network card looks for the available memory location in the RingBuffer allocated to it, and after finding it, the DMA engine will DMA the data to the memory associated with the network card. At this time, the CPU is insensitive. When the DMA operation is completed, the Nic will initiate a hard interrupt like CPU to notify CPU that data has arrived.

Fig. 8 hard interrupt processing of Nic data Note: when the RingBuffer is full, new packets will be discarded. When ifconfig checks the network card, there can be an overruns in it, indicating that the ring queue is full of discarded packets. If packet loss is found, you may need to use the ethtool command to increase the length of the ring queue.

In the section of starting the network card, we mentioned that the handler function for the hard interrupt registration of the network card is igb_msix_ring.

/ / file: drivers/net/ethernet/intel/igb/igb_main.cstatic irqreturn_t igb_msix_ring (int irq, void * data) {struct igb_q_vector * q_vector = data;igb_write_itr (q_vector); napi_schedule (& q_vector-napi); return IRQ_HANDLED;} igb_write_itr just records the hardware interrupt frequency (said to be used when reducing the interrupt frequency of CPU). Follow the napi_schedule call all the way, _ _ napi_schedule= > _ napi_schedule

Static inline void _ napi_schedule (struct softnet_data * sd,struct napi_struct * napi) {list_add_tail (& napi-poll_list, & sd-poll_list); _ _ raise_softirq_irqoff (NET_RX_SOFTIRQ);} here we see that list_add_tail modifies the poll_list in the CPU variable softnet_data to add the poll_list passed by the driver napi_struct. The poll_list in softnet_data is a bidirectional list in which all devices have input frames waiting to be processed. Then _ _ raise_softirq_irqoff triggers a soft interrupt NET_RX_SOFTIRQ, a so-called trigger process that only performs an OR operation on a variable.

Void _ _ raise_softirq_irqoffunsigned int nrtrace_softirq_raisenr;or_softirq_pending1UL nr;} / / file: include/linux/irq_cpustat.h#define or_softirq_pendingx (local_softirq_pending) | = (x) as we said, Linux does only the simple necessary work in hard interrupts, and most of the rest of the processing is transferred to soft interrupts. As you can see from the above code, the hard interrupt handling process is really very short. Just record a register, modify the poll_list of CPU, and then issue a soft interrupt. It's as simple as that, and the hard interruption is done.

3.2 ksoftirqd kernel threads handle soft interrupts

Figure 9 when the ksoftirqd kernel thread kernel thread initializes, we introduce the two thread functions ksoftirqd_should_run and run_ksoftirqd in ksoftirqd. The ksoftirqd_should_run code is as follows:

Static int ksoftirqd_should_run (unsigned int cpu) {return local_softirq_pending ();} # define local_softirq_pending ()\ _ _ IRQ_STAT (smp_processor_id (), _ _ softirq_pending) see here that the same function local_softirq_pending is called in the hard interrupt. The difference in usage is that the hard interrupt position is for writing tokens, which is just a read. If NET_RX_SOFTIRQ is set in the hard interrupt, it can be read here naturally. Next, you will actually enter the run_ksoftirqd processing in the threaded function:

Static void run_ksoftirqd (unsigned int cpu) {local_irq_disable (); if (local_softirq_pending ()) {_ _ do_softirq (); rcu_note_context_switch (cpu); local_irq_enable (); cond_resched (); return;} local_irq_enable (); in _ _ do_softirq, it is determined that the action method registered by the current CPU is called according to its soft interrupt type.

Asmlinkage void _ do_softirq (void) {do {if (pending & 1) {unsigned int vec_nr = h-softirq_vec;int prev_count = preempt_count (); trace_softirq_entry (vec_nr); h-action (h); trace_softirq_exit (vec_nr);} hobbies pending > = 1;} while (pending);} in the network subsystem initialization section, we see that we have registered the handler net_rx_action for NET_RX_SOFTIRQ. So the net_rx_action function will be executed.

One detail to note here is that setting the soft interrupt flag in the hard interrupt and determining whether the soft interrupt arrives or not are all based on smp_processor_id (). This means that as long as the hard interrupt is responded to on which CPU, then the soft interrupt is also handled on that CPU. So, if you find that your Linux soft interrupt CPU consumption is concentrated on one core, adjust the CPU affinity of hard interrupts to spread the hard interrupts to different CPU cores.

Let's focus on the core function net_rx_action.

Static void net_rx_action (struct softirq_action * h) {struct softnet_data * sd = & _ get_cpu_var (softnet_data); unsigned long time_limit = jiffies + 2 witint budget = netdev_budget;void * have;local_irq_disable (); while (! list_empty (& sd-poll_list)) {n = list_first_entry (& sd-poll_list, struct napi_struct, poll_list); work = 0 The time_limit and budget at the beginning of the if (test_bit (NAPI_STATE_SCHED, & n-state)) {work = n-poll (n, weight); trace_napi_poll (n);} budget-= work;}} function are used to control the active exit of the net_rx_action function to ensure that the reception of network packets does not occupy CPU. Wait until the next time the network card has another hard interrupt, and then process the remaining received data packets. Budget can be adjusted by kernel parameters. The remaining core logic in this function is to get the current CPU variable softnet_data, traverse its poll_list, and then execute the poll function registered by the Nic driver. For the igb network card, it is the igb_poll function of the igb driving force.

Static int igb_poll (struct napi_struct * napi, int budget) {if (q_vector-tx.ring) clean_complete = igb_clean_tx_irq (q_vector); if (q_vector-rx.ring) clean_complete & = igb_clean_rx_irq (q_vector, budget);} in the read operation, igb_poll focuses on the call to igb_clean_rx_irq.

Static bool igb_clean_rx_irq (struct igb_q_vector * q_vector, const int budget) {... Do {skb = igb_fetch_rx_buffer (rx_ring, rx_desc, skb); if (igb_is_non_eop (rx_ring, rx_desc)) continue;} if (igb_cleanup_headers (rx_ring, rx_desc, skb)) {skb = NULL; continue } igb_process_skb_fields (rx_ring, rx_desc, skb); napi_gro_receive (& qroomVector-> napi, skb);} igb_fetch_rx_buffer and igb_is_non_eop are used to remove data frames from the RingBuffer. Why do you need two functions? Because it is possible that the frame takes up more than one RingBuffer, it is acquired in a loop until the end of the frame. The acquired data frame is represented by a sk_buff. After receiving the data, do some verification on it, and then start setting the timestamp, VLAN id, protocol, and other fields of the sbk variable. Next, go to napi_gro_receive:

/ / file: net/core/dev.cgro_result_t napi_gro_receive (struct napi_struct * napi, struct sk_buff * skb) {skb_gro_reset_offset (skb); return napi_skb_finish (dev_gro_receive (napi, skb), skb) } dev_gro_receive function represents the GRO feature of the network card, which can be simply understood as merging related small packets into one large packet, which is designed to reduce the number of packets delivered to the network stack, which helps to reduce the usage of CPU. Let's ignore it for a moment and just look at napi_skb_finish. This function mainly calls netif_receive_skb.

/ / file: net/core/dev.cstatic gro_result_t napi_skb_finish (gro_result_t ret, struct sk_buff * skb) switch (ret) {case GRO_NORMAL:if (netif_receive_skb (skb)) ret = GRO_DROP;break;} in netif_receive_skb, packets are sent to the protocol stack. Statement, the following 3.3, 3.4, 3.5 are also soft interrupt processing process, but because the length is too long, separate sections.

The netif_receive_skb function of the network protocol stack will be processed according to the protocol of the packet. If it is a udp packet, the packet will be sent to the ip_rcv () and udp_rcv () protocol handlers for processing.

Figure 10 Network protocol stack processing

/ / file: net/core/dev.cint netif_receive_skb (struct sk_buff * skb) / / RPS processing logic, ignore return _ netif_receive_skb (skb);} static int _ netif_receive_skb (struct sk_buff * skb) ret = _ _ netif_receive_skb_core (skb, false);} static int _ netif_receive_skb_core (struct sk_buff * skb, bool pfmemalloc) {/ / pcap logic, where data is sent to the capture point. Tcpdump is the list_for_each_entry_rcu (ptype, & ptype_all, list) {if (! ptype- > dev ptype- > dev = = skb- > dev) {if (pt_prev) ret = deliver_skb (skb, pt_prev, orig_dev) from this entry; pt_prev = ptype }} list_for_each_entry_rcu (ptype,&ptype_ Base [ntohs (type) & PTYPE_HASH_MASK], list) {if (ptype- > type = = type & & (ptype- > dev = = null_or_dev ptype- > dev = = skb- > dev ptype- > dev = = orig_dev)) {if (pt_prev) ret = deliver_skb (skb, pt_prev, orig_dev); pt_prev = ptype } in _ _ netif_receive_skb_core, I looked at the packet grab points of tcpdump, which I used to use frequently, and I was very excited. It seems that the time to read the source code is really not wasted. Then _ _ netif_receive_skb_core fetches the protocol, which fetches the protocol information from the packet, and then iterates through the list of callback functions registered on the protocol. Ptype_base is a hash table, which we mentioned in the Protocol Registration section. The address of the ip_rcv function is stored in this hash table.

/ / file: net/core/dev.cstatic inline int deliver_skb (struct sk_buff * skb,struct packet_type * pt_prev,struct net_device * orig_dev) {return pt_prev-func (skb, skb-dev, pt_prev, orig_dev);} pt_prev- > func is called to the handler registered in the protocol layer. For the ip package, it goes to the ip_rcv (or, in the case of the arp package, to arp_rcv).

3.4 IP protocol layer processing Let's take a look at what linux does at the ip protocol layer and how packets are further sent to udp or tcp protocol handlers.

/ / file: net/ipv4/ip_input.cint ip_rcv (struct sk_buff * skb, struct net_device * dev, struct packet_type * pt, struct net_device * orig_dev) {return NF_HOOK (NFPROTO_IPV4, NF_INET_PRE_ROUTING, skb, dev, NULL,ip_rcv_finish); here NF_HOOK is a hook function. When the registered hook is executed, it will be executed to the function ip_rcv_finish pointed to by the last parameter.

Static int ip_rcv_finish (struct sk_buff * skb) {if (! skb_dst (skb)) {int err = ip_route_input_noref (skb, iph-daddr, iph-saddr,iph-tos, skb-dev);} return dst_input (skb);} trace ip_route_input_noref and see that it calls ip_route_input_mc again. In ip_route_input_mc, the function ip_local_deliver is assigned to dst.input, as follows:

/ / file: net/ipv4/route.cstatic int ip_route_input_mc (struct sk_buff * skb, _ _ be32 daddr, _ _ be32 saddr,u8 tos, struct net_device * dev, int our) {if (our) {rth-dst.input= ip_local_deliver;rth-rt_flags | = RTCF_LOCAL;}} so go back to return dst_input (skb) in ip_rcv_finish.

The input method called by static inline int dst_input (struct sk_buff * skb) {return skb_dst (skb)-input (skb);} skb_dst (skb)-> input is the ip_local_deliver assigned by the routing subsystem.

/ / file: net/ipv4/ip_input.cint ip_local_deliver (struct sk_buff * skb) {if (ip_is_fragment (ip_hdr (skb) {if (ip_defrag (skb, IP_DEFRAG_LOCAL_DELIVER)) return 0;} return NF_HOOK (NFPROTO_IPV4, NF_INET_LOCAL_IN, skb, skb- > dev, NULL,ip_local_deliver_finish) } static int ip_local_deliver_finish (struct sk_buff * skb) {. Int protocol = ip_hdr (skb)-> protocol;const struct net_protocol * ipprot;ipprot = rcu_dereference (inet_ proposals [protocol]); if (ipprot! = NULL) {ret = ipprot- > handler (skb);}} as seen in the protocol registration section, the function addresses of tcp_rcv () and udp_rcv () are stored in inet_protos. Here the distribution will be based on the protocol type in the package, where the skb packet will be further dispatched to the higher-level protocols, udp and tcp.

UDP protocol layer processing as we said in the protocol registration section, the handler function of the udp protocol is udp_rcv.

/ / file: net/ipv4/udp.cint udp_rcv (struct sk_buff * skb) {return _ udp4_lib_rcv (skb, & udp_table, IPPROTO_UDP);} int _ udp4_lib_rcv (struct sk_buff * skb, struct udp_table * udptable,int proto) sk = _ _ udp4_lib_lookup_skbskb, uh-source, uh-dest, udptable) If (sk! = NULL) int ret = udp_queue_rcv_skbsk, skb} icmp_send (skb, ICMP_DEST_UNREACH, ICMP_PORT_UNREACH, 0);} _ udp4_lib_lookup_skb is to find the corresponding socket based on skb, and when found, the packet is placed in the cache queue of socket. If it is not found, an unreachable icmp packet is sent.

/ / file: net/ipv4/udp.cint udp_queue_rcv_skb (struct sock * sk, struct sk_buff * skb) {if (sk_rcvqueues_full (sk, skb, sk-sk_rcvbuf)) goto drop;rc = 0 politic ipv4 (skb); bh_lock_sock (sk); if (! sock_owned_by_user (sk)) rc = _ udp_queue_rcv_skb (sk, skb) Else if (sk_add_backlog (sk, skb, sk-sk_rcvbuf) {bh_unlock_sock (sk); goto drop;} bh_unlock_sock (sk); return rc;} sock_owned_by_user determines whether the user is making a system call on this socker (socket is occupied). If not, it can be directly placed in the receiving queue of socket. If so, add the packet to the backlog queue through sk_add_backlog. When the user releases the socket, the kernel checks the backlog queue and moves it to the receive queue if any data is available.

If the sk_rcvqueues_full receive queue is full, the packet will be discarded directly. The receive queue size is affected by the kernel parameters net.core.rmem_max and net.core.rmem_default.

Four, recvfrom system call blossoms two, one for each table. Above we talked about the whole process of receiving and processing data packets in the Linux kernel, and finally put the packets in the receiving queue of socket. So let's look back at what happened after the user process called recvfrom. The recvfrom we call in the code is a library function of glibc, which, after execution, will trap the user into the kernel state and enter the system call sys_recvfrom implemented by Linux. Before we understand Linux versus sys_revvfrom, let's take a brief look at socket, the core data structure. This data structure is so large that we only draw the content related to our topic today, as follows:

Figure 11 the const struct proto_ops in the socket data structure of the socket kernel data mechanism corresponds to the set of methods of the protocol. Each protocol implements a different set of methods, and for the IPv4 Internet protocol family, each protocol has a corresponding processing method, as follows. For udp, it is defined by inet_dgram_ops, where the inet_recvmsg method is registered.

/ / file: net/ipv4/af_inet.cconst struct proto_ops inet_stream_ops = {. Recvmsg = inet_recvmsg,.mmap = sock_no_mmap,} const struct proto_ops inet_dgram_ops = {. Sendmsg = inet_sendmsg,.recvmsg = inet_recvmsg,} struct sock * sk, another data structure in the socket data structure, is a very large, very important substructure. Sk_prot defines the second-level processing function. For the UDP protocol, it will be set to the method set udp_prot implemented by the UDP protocol.

/ / file: net/ipv4/udp.cstruct proto udp_prot = {.name = "UDP", .owner = THIS_MODULE,.close = udp_lib_close,.connect = ip4_datagram_connect,.sendmsg = udp_sendmsg,.recvmsg = udp_recvmsg,.sendpage = udp_sendpage,} after looking at the socket variable, let's take a look at the implementation of sys_revvfrom.

Figure 12 the internal implementation of the recvfrom function calls sk- > sk_prot- > recvmsg in inet_recvmsg.

/ / file: net/ipv4/af_inet.cint inet_recvmsg (struct kiocb * iocb, struct socket * sock, struct msghdr * msg,size_t size, int flags) {err = sk-sk_prot-recvmsg (iocb, sk, msg,size, flags & MSG_DONTWAIT,flags & ~ MSG_DONTWAIT, & addr_len); if (err = 0) msg-msg_namelen = addr_len;return err } as mentioned above, for the socket of the udp protocol, this sk_prot is the struct proto udp_prot under net / ipv4 / udp.c. So we find the udp_recvmsg method.

/ / file:net/core/datagram.c:EXPORT_SYMBOL (_ _ skb_recv_datagram); struct sk_buff * _ skb_recv_datagram (struct sock * sk, unsigned int flags,int * peeked, int * off, int * err) {do {struct sk_buff_head * queue = & sk-sk_receive_queue;skb_queue_walk (queue, skb) {} error =-EAGAIN;if (! timeo) goto no_packet;} while (! wait_for_more_packets (sk, err, & timeo, last)) } finally we found the focus we wanted to see, and above we saw the so-called read process, which is to access sk- > sk_receive_queue. If there is no data and the user is allowed to wait, wait_for_more_packets () is called to perform the wait operation, which puts the user process to sleep.

Fifth, summarize that the network module is the most complex module in the Linux kernel. It seems that a simple packet collection process involves the interaction between many kernel components, such as network card driver, protocol stack, kernel ksoftirqd thread and so on. It looks very complicated, and this article wants to explain the kernel packing process clearly in an easy-to-understand way by illustration. Now let's string up the whole package collection process again.

When the user completes the recvfrom call, the user process works in the kernel state through the system call. If the receiving queue has no data, the process goes to sleep and is suspended by the operating system. This piece is relatively simple, and most of the rest of the play is performed by other modules in the Linux kernel.

First of all, Linux needs to do a lot of preparatory work before starting to collect the bags:

1. Create a ksoftirqd thread, set its own thread function for it, and then count on it to handle soft interrupts

two。 Protocol stack registration, linux to implement many protocols, such as arp,icmp,ip,udp,tcp, each protocol will register its own handler, to facilitate the package to quickly find the corresponding handler

3. Network card driver initialization, each driver has an initialization function, the kernel will let the driver also initialize. During this initialization process, get your DMA ready and tell the kernel the poll function address of NAPI

4. Start the network card, assign the RX,TX queue, and register the corresponding interrupt handling function

The above is the important work before the kernel is ready to receive the packet. After the above ready, you can turn on the hard interrupt and wait for the packet to arrive.

When the data arrives, the first person to greet it is the network card (I'll go, isn't that nonsense):

1. The network card DMA the data frame to the RingBuffer in memory, and then initiates the interrupt notification to CPU.

2. CPU responds to the interrupt request and calls the interrupt handling function registered when the network card starts.

3. The interrupt handler did almost nothing and made a soft interrupt request.

4. The kernel thread ksoftirqd found that a soft interrupt request came and turned off the hard interrupt first.

5. The ksoftirqd thread starts to call the driven poll function to receive packets.

6. The poll function sends the received packets to the ip_rcv function registered on the protocol stack.

7. Ip_rcv function and then send the package to the udp_rcv function (for tcp package, send it to tcp_rcv)

Now let's get back to the opening question: the simple line of recvfrom,Linux kernel we saw at the user level has so much work to do for us to get the data smoothly. This is still a simple UDP, if it is TCP, the kernel has more work to do, can not help but sigh that the kernel developers are really well-intentioned.

After understanding the whole packet collection process, we can clearly know the CPU cost of receiving a packet in Linux. First of all, the first block is the overhead of the user process call system call falling into the kernel state. The second block is the CPU cost of the hard interrupt of the CPU response packet. The third block is spent on the soft interrupt context of the ksoftirqd kernel thread. Later, we will send a special article to actually take a look at these expenses.

In addition, there are many details in the network transceiver that we have not expanded to say, such as no NAPI,GRO,RPS and so on. Because I think what I said is too right, it will affect everyone's grasp of the whole process, so try to keep only the main framework, less is more!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.