25 pictures, 10, 000 words, disassembling the Linux network packet sending process 07/13 Update SLTechnology News&Howtos

25 pictures, 10, 000 words, disassembling the Linux network packet sending process

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

This article comes from the official account of Wechat: developing Internal skills practice (ID:kfngxl). Author: Zhang Yanfei allen

Hello, everyone. I'm Brother Fei!

Before I begin today's article, I would like to ask you to think about a few small questions.

Q1: when we look at the CPU consumed by the kernel sending data, should we look at sy or si?

Q2: why is NET_RX much larger than NET_TX in / proc/ softirqs on your server?

Q3: what memory copy operations are involved when sending network data?

Although these questions are often seen online, we seldom look into them. If we can really understand these issues thoroughly, our ability to control performance will become stronger.

With these three questions in mind, we begin today's in-depth analysis of the network sending process of the Linux kernel. According to our previous tradition, let's start with a simple piece of code. The following code is a typical microcode for a typical server program:

Int main () {fd = socket (AF_INET, SOCK_STREAM, 0); bind (fd,); listen (fd,); cfd = accept (fd,); / / receive user request read (cfd,); / / user request processing dosometing (); / / return the result send (cfd, buf, sizeof (buf), 0) to the user Today we will discuss how the kernel sends out the packet after calling send in the above code. This paper is based on Linux 3.10, and the network card driver adopts Intel igb network card as an example.

Early warning: there are more than 10,000 words and 25 pictures in this article.

An overview of the sending process of the Linux network. I think the most important thing to look at the Linux source code is to have an overall grasp, rather than getting caught up in various details at the beginning.

I first prepared a general flow chart for you to briefly explain how the data sent by send is sent to the network card step by step.

In this figure, we see that the user data is copied to the kernel state, then processed by the protocol stack and then entered into the RingBuffer. Then the Nic driver actually sends the data out. When the send is complete, the CPU is notified by a hard interrupt, and then the RingBuffer is cleaned.

Because we want to enter the source code at the end of the article, we give a flow chart from the point of view of the source code.

Although the data has been sent by this time, there is one important thing that has not been done, and that is to release memory such as cache queues.

How does the kernel know when to release memory, of course, after the network is sent? When the network card is sent, it will send a hard interrupt to CPU to notify CPU. For a more complete process, see the figure:

Note that although our topic today is to send data, the soft interrupt triggered by the hard interrupt is NET_RX_SOFTIRQ, not NET_TX_SOFTIRQ! (t is the abbreviation of transmit, R stands for receive)

Are you surprised or surprised?

So this is part of the reason for opening question 1 (note that this is only part of the reason).

Q1: check / proc/ softirqs on the server. Why is NET_RX much bigger than NET_TX?

The completion of the transfer will eventually trigger the NET_RX, not the NET_TX. So naturally you can see more NET_RX by observing / proc/ softirqs.

Well, now you have a global grasp of how the kernel sends network packets. Don't be complacent, the details we need to know are more valuable, let's go on!

Second, the network card starts prepares the network card on the present server generally supports the multi-queue. Each queue is represented by a RingBuffer, and when multiple queues are turned on, the Nic will correspond to multiple RingBuffer.

One of the most important tasks of the network card at startup is to allocate and initialize RingBuffer. Understanding RingBuffer will be very helpful for us to master sending later. Since today's topic is sending, take the transmission queue as an example, let's take a look at the actual process of assigning RingBuffer when the network card is started.

When the Nic starts, the _ _ igb_open function is called, and this is where the RingBuffer is assigned.

/ / file: drivers/net/ethernet/intel/igb/igb_main.cstatic int _ _ igb_open (struct net_device * netdev, bool resuming) {struct igb_adapter * adapter = netdev_priv (netdev); / / assign transport descriptor array err = igb_setup_all_tx_resources (adapter); / / assign receive descriptor array err = igb_setup_all_rx_resources (adapter); / / Open all queues netif_tx_start_all_queues (netdev) } above, the _ _ igb_open function calls igb_setup_all_tx_resources to allocate all transport RingBuffer, and calls igb_setup_all_rx_resources to create all receive RingBuffer.

/ / file: drivers/net/ethernet/intel/igb/igb_main.cstatic int igb_setup_all_tx_resources (struct igb_adapter * adapter) {/ / construct several RingBuffer for with several queues (I = 0; I

< adapter->

Num_tx_queues; iTunes +) {igb_setup_tx_resources (adapter- > tx_ ringlet [I]);}} the real RingBuffer construction process is done in igb_setup_tx_resources.

/ / file: drivers/net/ethernet/intel/igb/igb_main.cint igb_setup_tx_resources (struct igb_ring * tx_ring) {/ / 1. Apply for igb_tx_buffer array memory size = sizeof (struct igb_tx_buffer) * tx_ring- > count; tx_ring- > tx_buffer_info = vzalloc (size); / / 2. Apply for e1000_adv_tx_desc DMA array memory tx_ring- > size = tx_ring- > count * sizeof (union e1000_adv_tx_desc); tx_ring- > size = ALIGN (tx_ring- > size, 4096); tx_ring- > desc = dma_alloc_coherent (dev, tx_ring- > size, & tx_ring- > dma, GFP_KERNEL); / / 3. Initialize queue members tx_ring- > next_to_use = 0; tx_ring- > next_to_clean = 0;} as you can see from the above source code, there are actually not only an array of circular queues, but two inside a RingBuffer.

1) igb_tx_buffer array: this array is used by the kernel and is requested through vzalloc.

2) e1000_adv_tx_desc array: this array is used by the hardware of the network card. The hardware can access this memory directly through DMA and allocate it through dma_alloc_coherent.

There is no connection between them at this time. When sent in the future, pointers in the same position in both circular arrays will point to the same skb. In this way, the kernel and the hardware can access the same data together. The kernel writes data to the skb, and the network card hardware is responsible for sending it.

Finally, call netif_tx_start_all_queues to open the queue. In addition, the handler function igb_msix_ring for hard interrupts is actually registered in _ _ igb_open.

3. Accept create a new socket before sending data, we often need a socket that has already established a connection.

Let's take the accept mentioned in the opening server miniature source code as an example. After accept, the process will create a new socket and put it in the open file list of the current process, dedicated to communicating with the corresponding client.

Assuming that the server process establishes two connections with the client through accept, let's take a brief look at the relationship between the two connections and the process.

A more specific structure diagram of the socket kernel object that represents a connection is as follows.

In order to avoid hustle and bustle, the detailed source code process of accept will not be introduced here, but today we will focus on the data transmission process.

Fourth, the real start of sending data 4.1 send system call to achieve send system call source code is located in the file net / socket.c. In this system call, the sendto system call is actually used internally. Although the whole call chain is not short, it only does two simple things.

The first is to find the real socket in the kernel, where the function addresses of various protocol stacks are recorded.

The second is to construct a struct msghdr object to load all the data passed in by the user, such as buffer address, data length and so on.

The rest is left to the next layer, the function inet_sendmsg in the protocol stack, where the address of the inet_sendmsg function is found through the ops member in the socket kernel object. The general process is shown in the figure.

With the above understanding, it will be much easier for us to look at the origin code. The source code is as follows

/ / file: net/socket.cSYSCALL_DEFINE4 (send, int, fd, void _ user *, buff, size_t, len, unsigned int, flags) {return sys_sendto (fd, buff, len, flags, NULL, 0);} SYSCALL_DEFINE6 () {/ / 1 find socket sock = sockfd_lookup_light (fd, & err, & fput_needed) according to fd; / / 2. Construct msghdr struct msghdr msg; struct iovec iov; iov.iov_base = buff; iov.iov_len = len; msg.msg_iovlen = 1; msg.msg_iov = & iov; msg.msg_flags = flags; / / 3. Send data sock_sendmsg (sock, & msg, len);} you can see from the source code that the send function and sendto function we use in user mode are actually implemented by sendto system calls. Send is just a way to encapsulate it that is easier to call for convenience.

In sendto system calls, the real socket kernel object is first found according to the socket handle number passed in by the user. Then the buff, len, flag and other parameters requested by the user are all packaged into a struct msghdr object.

Then sock_sendmsg = > _ _ sock_sendmsg = = > _ _ sock_sendmsg_nosec is called. In _ _ sock_sendmsg_nosec, the call will be entered into the protocol stack by the system call. Let's take a look at its source code.

/ / file: net/socket.cstatic inline int _ _ sock_sendmsg_nosec () {return sock-ops-sendmsg (iocb, sock, msg, size);} through the socket kernel object structure diagram in section 3, we can see that what is called here is sock- > ops- > sendmsg that actually executes inet_sendmsg. This function is a general sending function provided by the AF_INET protocol family.

4.2 Transport layer processing 1) Transport layer copy

After entering the protocol stack inet_sendmsg, the kernel then finds the specific protocol sending function on the socket. For the TCP protocol, that is tcp_sendmsg (also found through the socket kernel object).

In this function, the kernel requests a kernel-state skb memory to copy the data to be sent by the user. Note that it may not really start to send at this time, and if the send condition is not met, it is likely that this call will return directly. The approximate process is shown in the figure:

Let's look at the source code of the inet_sendmsg function.

/ / file: net/ipv4/af_inet.cint inet_sendmsg () {return sk-sk_prot-sendmsg (iocb, sk, msg, size);} the sending function of the specific protocol is called in this function. Also referring to the socket kernel object structure diagram in section 3, we can see that for socket under the TCP protocol, sk- > sk_prot- > sendmsg points to tcp_sendmsg (udp_sendmsg for UPD).

The function tcp_sendmsg is rather long, so let's look at it several times. Take a look at this paragraph first.

/ / file: net/ipv4/tcp.cint tcp_sendmsg (...) {while (...) {while (...) {/ / get the sending queue skb = tcp_write_queue_tail (sk); / / apply for skb and copy. }} / / file: include/net/tcp.hstatic inline struct sk_buff * tcp_write_queue_tail (const struct sock * sk) {return skb_peek_tail (& sk-sk_write_queue);} understand that calling tcp_write_queue_tail on socket is a prerequisite for understanding sending. As shown above, this function is getting the last skb in the socket send queue. Skb is the abbreviation of struct sk_buff object, and the user's sending queue is a linked list of this object.

Let's move on to the rest of tcp_sendmsg.

/ / file: net/ipv4/tcp.cint tcp_sendmsg (struct kiocb * iocb, struct sock * sk, struct msghdr * msg, size_t size) {/ / get the data and flags passed by the user iov = msg- > msg_iov; / / user data address iovlen = msg- > msg_iovlen; / / the number of blocks is 1 flags = msg- > msg_flags / / various flags / / traversing the block while of the user layer (--iovlen > = 0) {/ / the address of the block to be sent unsigned char _ _ user * from = iov- > iov_base; while (seglen > 0) {/ / you need to apply for a new skb if (copy sk_allocation); / / attach the skb to the sending queue of socket skb_entail (sk, skb) } / / there is enough space in skb if (skb_availroom (skb) > 0) {/ / copy the user space data to kernel space, and calculate that the checksum / / from is the user space data address skb_add_data_nocache (sk, skb, from, copy);}. This function is long, but the logic is not complicated. Where msg- > msg_iov stores the buffer of the data to be sent in user-mode memory. Next, kernel mode requests kernel memory, such as skb, and copies the data from the user's memory to kernel mode memory. This involves the overhead of one or more memory copies.

As for when the kernel will actually send out the skb. Some judgments are made in tcp_sendmsg.

/ / file: net/ipv4/tcp.cint tcp_sendmsg () {while () {while () {apply for kernel memory and copy / / send to determine if (forced_push (tp)) {tcp_mark_push (tp, skb); _ _ tcp_push_pending_frames (sk, mss_now, TCP_NAGLE_PUSH);} else if (skb = tcp_send_head (sk)) tcp_push_one (sk, mss_now) } continue;}} the kernel actually starts sending packets only if it satisfies that forced_push (tp) or skb = = tcp_send_head (sk) is established. Among them, forced_push (tp) determines whether the unsent data has exceeded half of the maximum window.

If the conditions are not met, the data to be sent by the user this time will only be copied to the kernel.

2) send at the transport layer

Assuming that the kernel sending conditions have been met, let's track the actual sending process. As for the function in the previous section, whether the _ _ tcp_push_pending_frames or tcp_push_one is called, it will actually be executed to tcp_write_xmit when the actual send condition is met.

So we look directly from tcp_write_xmit, this function deals with the congestion control of the transport layer, sliding window related work. When the window requirements are met, set the TCP header and send the skb to the lower network layer for processing.

Let's take a look at the tcp_write_xmit source code.

/ / file: net/ipv4/tcp_output.cstatic bool tcp_write_xmit (struct sock * sk, unsigned int mss_now, int nonagle, int push_one, gfp_t gfp) {/ / cycle to get skb while to be sent ((skb = tcp_send_head (sk) {/ / sliding window related cwnd_quota = tcp_cwnd_test (tp, skb); tcp_snd_wnd_test (tp, skb, mss_now); tcp_mss_split_point () Tso_fragment (sk, skb,); / / really open and send tcp_transmit_skb (sk, skb, 1, gfp);}} you can see that the sliding window and congestion control we learned in the network protocol are completed in this function. This part is not expanded too much. Interested students can find this source code to read it. Today we only look at the main sending process, which leads us to tcp_transmit_skb.

/ / file: net/ipv4/tcp_output.cstatic int tcp_transmit_skb (struct sock * sk, struct sk_buff * skb, int clone_it, gfp_t gfp_mask) {/ / 1 Clone the new skb to if (likely (clone_it)) {skb = skb_clone (skb, gfp_mask);} / / 2. Package TCP header th = tcp_hdr (skb); th-source = inet-inet_sport; th-dest = inet-inet_dport; th-window =; th-urg =; / / 3. Call the network layer sending API err = icsk- > icsk_af_ops- > queue_xmit (skb, & inet-cork.fl);} the first thing is to clone a new skb. Here, why copy a skb?

Because the skb then calls the network layer and finally arrives at the end of the sending of the network card, the skb will be released. We know that the TCP protocol supports lost retransmission, and the skb cannot be deleted until the other party's ACK is received. So the kernel is actually passing a copy of skb every time the network card is called to send it. Wait until you receive the ACK before you delete it.

The second thing is to modify the TCP header in skb and set the TCP header according to the actual situation. Here's a tip. Skb actually contains all the header in the network protocol. When setting the TCP header, just point the pointer to the appropriate location of the skb. When setting the IP header later, just move the pointer to avoid frequent memory requests and copies, which is very efficient.

Tcp_transmit_skb is the last step in sending data at the transport layer, and then you can go to the network layer for the next layer of operation. The sending interface icsk- > icsk_af_ops- > queue_xmit () provided by the network layer is called.

In the following source code, we know that queue_xmit actually points to the ip_queue_xmit function.

/ / file: net/ipv4/tcp_ipv4.cconst struct inet_connection_sock_af_ops ipv4_specific = {.queue _ xmit = ip_queue_xmit, .send _ check = tcp_v4_send_check,} since then, the work of the transport layer is done. The data leaves the transport layer and then goes into the kernel's implementation at the network layer.

The implementation of network layer send processing Linux kernel network layer send is located in the file net / ipv4 / ip_output.c. The ip_queue_xmit called by the transport layer is also here. (you can also see from the file name that you have entered the IP layer, and the source file name has changed from tcp_xxx to ip_xxx. )

In the network layer, it mainly deals with routing item lookup, IP header setting, netfilter filtering, skb segmentation (if greater than MTU) and other tasks, which will be handed over to the neighbor subsystem of the lower layer.

Let's take a look at the source code of the network layer entry function ip_queue_xmit:

/ / file: net/ipv4/ip_output.cint ip_queue_xmit (struct sk_buff * skb, struct flowi * fl) {/ / check whether there is a cached routing table in socket rt = (struct rtable *) _ sk_dst_check (sk, 0); if (rt = = NULL) {/ / expand search / / find routing entries and cache them in socket rt = ip_route_output_ports () Sk_setup_caps (sk, & rt- > dst);} / / set routing table skb_dst_set_noref for skb (skb, & rt- > dst); / / set IP header iph = ip_hdr (skb); iph- > protocol = sk- > sk_protocol; iph- > ttl = ip_select_ttl (inet, & rt- > dst); iph- > frag_off =; / / send ip_local_out (skb) } ip_queue_xmit has reached the network layer. In this function, we can see the function of routing item lookup related to the network layer. If it is found, it will be set to skb (if there is no route, the error will be returned directly).

You can see your native routing configuration through the route command on Linux.

In the routing table, you can find out which Iface (network card) and which Gateway (network card) should be sent to a destination network. After finding out, it is cached on socket, and you don't have to check the data next time you send it.

Then put the routing table address in the skb.

/ / file: include/linux/skbuff.hstruct sk_buff {/ / saves some routing-related information unsigned long _ skb_refdst;} then navigate to the location of the IP header in skb, and then start to set IP header according to the protocol specification.

Then proceed to the next step of processing through ip_local_out.

/ / file: net/ipv4/ip_output.c int ip_local_out (struct sk_buff * skb) {/ / execute netfilter filtering err = _ _ ip_local_out (skb); / / start sending data if (likely (err = = 1)) err = dst_output (skb);. Netfilter filtering is performed at ip_local_out = > _ _ ip_local_out = > nf_hook. If you configure some rules using iptables, then this will detect if the rules are hit. If you set very complex netfilter rules, the CPU overhead of your process will be greatly increased in this function.

Still do not expand to say, continue to talk only about the process related to sending dst_output.

/ / file: include/net/dst.hstatic inline int dst_output (struct sk_buff * skb) {return skb_dst (skb)-output (skb);} this function finds the routing table (dst entry) of this skb, and then calls the routing table's output method. This is another function pointer that points to the ip_output method.

/ / file: net/ipv4/ip_output.cint ip_output (struct sk_buff * skb) {/ / Statistics. / / send it to netfilter again, and call back ip_finish_output return NF_HOOK_COND (NFPROTO_IPV4, NF_INET_POST_ROUTING, skb, NULL, dev, ip_finish_output,! (IPCB (skb)-> flags & IPSKB_REROUTED);} do some simple statistical work in ip_output, and perform netfilter filtering again. Call back ip_finish_output after the filter is passed.

/ / file: if net/ipv4/ip_output.cstatic int ip_finish_output (struct sk_buff * skb) {/ / is greater than mtu, sharding will be performed. If (skb- > len > ip_skb_dst_mtu (skb) & &! skb_is_gso (skb)) return ip_fragment (skb, ip_finish_output2); else return ip_finish_output2 (skb);} We see in ip_finish_output that sharding will be performed if the data is larger than MTU.

The actual MTU size determination depends on MTU discovery, and the Ethernet frame is 1500 bytes. In the early days, the QQ team tried to optimize network performance by controlling that its packet size was smaller than MTU. Because slicing will bring two problems: 1, the need for additional slicing processing, there is additional performance overhead. 2. As long as a fragment is lost, the whole packet has to be retransmitted. Therefore, avoiding fragmentation not only eliminates the fragmentation overhead, but also greatly reduces the retransmission rate.

In ip_finish_output2, the sending process finally moves on to the next layer, the neighbor subsystem.

/ / file: net/ipv4/ip_output.cstatic inline int ip_finish_output2 (struct sk_buff * skb) {/ / find neighbor entries based on the next-hop IP address. If not, create a nexthop = (_ force U32) rt_nexthop (rt, ip_hdr (skb)-> daddr); neigh = _ _ ipv4_neigh_lookup_noref (dev, nexthop); if (unlikely (! neigh)) neigh = _ neigh_create (& arp_tbl, & nexthop, dev, false) / / continue to pass int res = dst_neigh_output (dst, neigh, skb) to the lower layer;} 4.4 the neighbor subsystem is a system located between the network layer and the data link layer, and its function is to provide an encapsulation to the network layer, so that the network layer does not have to care about the address information of the lower layer, and let the lower layer decide which MAC address to send to.

And the neighbor subsystem is not in the net / ipv4/ directory of the protocol stack, but in net / core / neighbour.c. Because you need to use this module for both IPv4 and IPv6.

In the neighbor subsystem, the main task is to find or create neighbor items, and when creating neighbor items, it is possible to issue actual arp requests. Then encapsulate the MAC header and pass the transmission process to the lower network equipment subsystem. The general process is shown in the figure.

After understanding the general process, let's look back at the source code. _ _ ipv4_neigh_lookup_noref is called in the ip_finish_output2 source code in the above section. It is looked up in the arp cache, and its second parameter passes in the routing next-hop IP information.

/ / file: include/net/arp.hextern struct neigh_table arp_tbl;static inline struct neighbour * _ ipv4_neigh_lookup_noref (struct net_device * dev, U32 key) {struct neigh_hash_table * nht = rcu_dereference_bh (arp_tbl.nht); / / calculate the hash value and speed up the search for hash_val = arp_hashfn (); for (n = rcu_dereference_bh (nht- > hash_ buckets [hash _ val]); n! = NULL N = rcu_dereference_bh (n-> next) {if (n-> dev = = dev & & * (U32 *) n-> primary_key = = key) return n;}} if not found, call _ _ neigh_create to create a neighbor.

/ / file: net/core/neighbour.cstruct neighbour * _ neigh_create () {/ / apply for neighbor table entries struct neighbour * N1, * rc, * n = neigh_alloc (tbl, dev); / / construct assignment memcpy (n-> primary_key, pkey, key_len); n-> dev = dev; n-> parms- > neigh_setup (n); / / finally add rcu_assign_pointer (nht- > hash_ buckets [hash _ val], n) to neighbor hashtable After you have a neighbor entry, you still do not have the ability to send IP messages at this time, because the destination MAC address has not yet been obtained. Call dst_neigh_output to continue passing skb.

/ / file: include/net/dst.hstatic inline int dst_neigh_output (struct dst_entry * dst, struct neighbour * n, struct sk_buff * skb) {return n-output (n, skb);} calls output, which actually points to neigh_resolve_output. It is possible to issue an arp network request inside this function.

/ / file: net/core/neighbour.cint neigh_resolve_output () {/ / Note: arp request if (! neigh_event_send (neigh, skb)) {/ / neigh- > ha is MAC address dev_hard_header (skb, dev, ntohs (skb- > protocol), neigh- > ha, NULL, skb- > len); / / send dev_queue_xmit (skb) }} once the hardware MAC address is obtained, the skb MAC header can be encapsulated. Finally, dev_queue_xmit is called to pass the skb to the Linux network equipment subsystem.

4.5 Network equipment Subsystem

The neighbor subsystem enters into the network equipment subsystem through dev_queue_xmit.

/ / file: net/core/dev.c int dev_queue_xmit (struct sk_buff * skb) {/ / Select send queue txq = netdev_pick_tx (dev, skb); / / get the queuing rule Q = rcu_dereference_bh (txq- > qdisc) associated with this queue; / / if there is a queue, call _ _ dev_xmit_skb to continue processing data if (Q-> enqueue) {rc = _ dev_xmit_skb (skb, Q, dev, txq) Goto out;} / / without queues are loopback devices and tunneling devices.} in the second section of the opening section, we said that the Nic has multiple sending queues (especially today's NICs). The call to the netdev_pick_tx function on the is to select a queue to send.

The choice of netdev_pick_tx send queue is affected by configurations such as XPS, and there is also caching, which is also a set of small and complex logic. Here we only focus on two logics. first, we will get the user's XPS configuration, otherwise it will be calculated automatically. For the code, see netdev_pick_tx = > _ _ netdev_pick_tx.

/ / file: net/core/flow_dissector.cu16 _ _ netdev_pick_tx (struct net_device * dev, struct sk_buff * skb) {/ / get XPS configuration int new_index = get_xps_queue (dev, skb); / / automatically calculate queue if (new_index

< 0) new_index = skb_tx_hash(dev, skb);}然后获取与此队列关联的 qdisc。在 linux 上通过 tc 命令可以看到 qdisc 类型，例如对于我的某台多队列网卡机器上是 mq disc。 #tc qdiscqdisc mq 0: dev eth0 root大部分的设备都有队列（回环设备和隧道设备除外），所以现在我们进入到 __dev_xmit_skb。 //file: net/core/dev.cstatic inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, struct net_device *dev, struct netdev_queue *txq){ //1如果可以绕开排队系统 if ((q->

Flags & TCQ_F_CAN_BYPASS) & &! qdisc_qlen (Q) & & qdisc_run_begin (Q)) {} / / 2. Normal queuing else {/ / queuing Q-> enqueue (skb, Q) / / start sending _ _ qdisc_run (Q);}} there are two cases in the above code: 1 can bypass (bypass) the queuing system, and the other is normal queuing. Let's just look at the second case.

First call Q-> enqueue to add skb to the queue. Then call _ _ qdisc_run to start sending.

/ / file: net/sched/sch_generic.cvoid _ _ qdisc_run (struct Qdisc * Q) {int quota = weight_p; / / Loop takes a skb from the queue and sends while (qdisc_restart (Q)) {/ / if one of the following occurs, postpone processing: / / 1. Quota exhaustion / / 2. The process needs CPU if (--quota output_queue) {/ / to point the head to the first qdisc head = sd- > output_queue; / / traversing the qdsics list while (head) {struct Qdisc * Q = head; head = head- > next_sched; / / send data qdisc_run (Q);}} the soft interrupt will get the softnet_data here. We saw earlier that the kernel state of the process writes the send queue to the output_queue of softnet_data when calling _ _ netif_reschedule. The soft interrupt loops through sd- > output_queue to send data frames.

Let's take a look at qdisc_run, which, like the process user state, calls _ _ qdisc_run.

/ / file: include/net/pkt_sched.hstatic inline void qdisc_run (struct Qdisc * Q) {if (qdisc_run_begin (Q)) _ qdisc_run (Q); then enter qdisc_restart = > sch_direct_xmit again until the driver function dev_hard_start_xmit.

As we saw earlier, both for the kernel state of the user process and for the soft interrupt context, the igb Nic driver will call the dev_hard_start_xmit function in the network device subsystem. In this function, the sending function igb_xmit_frame in the driver is called.

In the driver function, the skb will be linked to the RingBuffer, and after the driver call is completed, the packet will actually be sent from the network card.

Let's look at the actual source code:

/ / file: net/core/dev.cint dev_hard_start_xmit (struct sk_buff * skb, struct net_device * dev, struct netdev_queue * txq) {/ / get the set of callback functions of the device ops const struct net_device_ops * ops = dev- > netdev_ops; / / get the list of features supported by the device features = netif_skb_features (skb) / / call the send callback function ndo_start_xmit in the driver's ops to send the packet to the Nic device skb_len = skb- > len; rc = ops- > ndo_start_xmit (skb, dev).} where ndo_start_xmit is a function to be implemented by the Nic driver and is defined in net_device_ops.

/ / file: include/linux/netdevice.hstruct net_device_ops {netdev_tx_t (* ndo_start_xmit) (struct sk_buff * skb, struct net_device * dev);} in the igb Nic driver source code, we found it.

/ / file: drivers/net/ethernet/intel/igb/igb_main.cstatic const struct net_device_ops igb_netdev_ops = {.ndo _ open = igb_open, .ndo _ stop = igb_close, .ndo _ start_xmit = igb_xmit_frame,}; that is, the implementation function of ndo_start_xmit,igb defined at the network device layer is igb_xmit_frame. This function is assigned when the Nic driver initializes. For the specific initialization process, see the initialization of the network card driver in section 2.4 of the article "illustrating the Linux network packet receiving process".

So when you call ops- > ndo_start_xmit at the above network device layer, you will actually enter the igb_xmit_frame function. Let's go into this function to see how the driver works.

/ / file: drivers/net/ethernet/intel/igb/igb_main.cstatic netdev_tx_t igb_xmit_frame (struct sk_buff * skb, struct net_device * netdev) {return igb_xmit_frame_ring (skb, igb_tx_queue_ming (adapter, skb)) } netdev_tx_t igb_xmit_frame_ring (struct sk_buff * skb, struct igb_ring * tx_ring) {/ / get the information of the next available buffer in TX Queue first = & tx_ring- > tx_buffer_ info [TX _ ring- > next_to_use]; first- > skb = skb; first- > bytecount = skb- > len; first- > gso_segs = 1; / / the data that the igb_tx_map function is ready to send to the device. Igb_tx_map (tx_ring, first, hdr_len);} here take an element from the RingBuffer of the sending queue of the network card and hang the skb on the element.

The igb_tx_map function handles mapping skb data to memory DMA areas that can be accessed by the network card.

/ / file: drivers/net/ethernet/intel/igb/igb_main.cstatic void igb_tx_map (struct igb_ring * tx_ring, struct igb_tx_buffer * first, const U8 hdr_len) {/ / get the next available descriptor pointer tx_desc = IGB_TX_DESC (tx_ring, I) / / construct a memory map for skb- > data, allowing devices to read data from RAM via DMA dma = dma_map_single (tx_ring- > dev, skb- > data, size, DMA_TO_DEVICE); / / traverse all shards of the packet and generate a valid mapping for (frag = & skb_shinfo (skb)-> frags [0] frag++) {tx_desc- > read.buffer_addr = cpu_to_le64 (dma) for each shard of skb Tx_desc- > read.cmd_type_len =; tx_desc- > read.olinfo_status = 0;} / / set the last descriptor cmd_type | = size | IGB_TXD_DCMD; tx_desc- > read.cmd_type_len = cpu_to_le32 (cmd_type); / * Force memory writes to complete before letting hops know there * are new descriptors to fetch * / wmb () } when all the required descriptors have been built and all the data of the skb is mapped to the DMA address, the driver will proceed to its last step, triggering the actual send.

4.8 sending completed hard interrupt when the data was sent, the work was not finished. Because the memory hasn't been cleaned yet. When the transmission is complete, the network card device triggers a hard interrupt to free memory.

In sections 3. 1 and 3. 2 of the article "illustrating the receiving process of Linux network packets", we describe in detail the handling of hard interrupts and soft interrupts.

After sending the hard interrupt, the RingBuffer memory cleanup is performed, as shown in the figure.

Look back at the source code of the hard interrupt that triggers the soft interrupt.

/ / file: drivers/net/ethernet/intel/igb/igb_main.cstatic inline void _ napi_schedule () {list_add_tail (& napi-poll_list, & sd-poll_list); _ _ raise_softirq_irqoff (NET_RX_SOFTIRQ);} there is an interesting detail here. Whether the hard interrupt is because there is data to receive or sends a completion notification, the soft interrupt triggered from the hard interrupt is NET_RX_SOFTIRQ. We talked about this in the first section, which is one of the reasons why RX is higher than TX in soft interrupt statistics.

Okay, let's move on to the callback function igb_poll of the soft interrupt. In this function, we notice a line of igb_clean_tx_irq, see the source code:

/ / file: drivers/net/ethernet/intel/igb/igb_main.cstatic int igb_poll (struct napi_struct * napi, int budget) {/ / performs the transmit completion operations if (q_vector-tx.ring) clean_complete = igb_clean_tx_irq (q_vector);} Let's see what igb_clean_tx_irq does when the transfer is complete.

/ / file: drivers/net/ethernet/intel/igb/igb_main.cstatic bool igb_clean_tx_irq (struct igb_q_vector * q_vector) {/ / free the skb dev_kfree_skb_any (tx_buffer-skb); / / clear tx_buffer data tx_buffer-skb = NULL; dma_unmap_len_set (tx_buffer, len, 0) / / clear last DMA location and unmap remaining buffers * / while (tx_desc! = eop_desc) {}} is nothing more than cleaning up skb, unmapping DMA, and so on. At this point, the transmission is basically complete.

Why do I say that it is basically completed, not all of it? Because the transport layer needs to be reliable, skb has not actually been deleted. It will not be deleted until it receives the other party's ACK, and then it will be completely sent.

Finally, a picture is used to summarize the whole sending process.

After understanding the whole sending process, let's go back to a few of the issues mentioned at the beginning.

1. When monitoring the CPU consumed by the kernel sending data, should we look at sy or si?

In the process of sending network packets, the user process (kernel mode) completes most of the work, and even does the work of calling the driver. Soft interrupts are initiated only when the kernel-state process is cut off. In the process of sending, most (90%) of the overhead is consumed in the kernel state of the user process.

Only a few cases trigger a soft interrupt (NET_TX type), which is sent by the soft interrupt ksoftirqd kernel process.

Therefore, when monitoring the CPU cost caused by the network IO to the server, we should not only look at the si, but also take both si and sy into account.

two。 Check / proc/ softirqs on the server, why is NET_RX much bigger than NET_TX?

I thought NET_RX was read and NET_TX was transmission. For a Server that both receives the user's request and returns it to the user. The numbers of these two pieces should be about the same, at least there will not be an order of magnitude difference. But in fact, Brother Fei has a server like this:

After today's source code analysis, it is found that there are two reasons for this problem.

The first reason is that when the data is sent, the driver is notified by hard interrupt. But whether the hard interrupt is received or sent, the soft interrupt triggered is NET_RX_SOFTIRQ, not NET_TX_SOFTIRQ.

The second reason is that for reads, they all go through NET_RX soft interrupts and go through the ksoftirqd kernel process. As for sending, most of the work is handled in the kernel state of the user process, and only when the quota in the system state is exhausted will NET_TX be issued for soft interrupts.

For two reasons, it is not difficult to understand that NET_RX is much larger than NET_TX on the machine.

3. What memory copy operations are involved when sending network data?

For the memory copy here, we only refer to the memory copy of the data to be sent.

The first copy operation is that after the kernel has applied for the skb, all the data in the buffer passed by the user will be copied to the skb. If the amount of data to be sent is relatively large, this copy operation is still expensive.

The second copy operation is that when you enter the network layer from the transport layer, a new copy of each skb is cloned. The network layer and the following drivers, soft interrupts and other components will delete this copy when the transmission is completed. The transport layer holds the original skb, and when the other side of the network does not have an ack, it can also be re-sent to achieve the reliable transmission required in the TCP.

A third copy is not required and is required only if the IP layer discovers that skb is greater than MTU. Additional skb will be requested and the original skb will be copied into multiple small skb.

Here to insert a digression, we often hear in the network performance optimization of the zero copy, I think this is a bit of exaggeration. In order to ensure the reliability of TCP, the second copy can not be saved at all. If the package is larger than MTU, the copy when sharding is also unavoidable.

Seeing this, I believe that sending packets from the kernel is no longer a black box that you don't understand at all. In this article, even if you only understand 1/10, you have already mastered how to open this black box. You will know where to start when you optimize network performance in the future.

Github: https://github.com/yanfeizhang/coder-kung-fu

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.