How to analyze the K8s Network caused by customer demand 07/03 Update SLTechnology News&Howtos

How to analyze the K8s Network caused by customer demand

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article will explain in detail how to analyze the K8s network inquiry caused by customer requirements. The content of the article is of high quality, so the editor will share it for you as a reference. I hope you will have a certain understanding of the relevant knowledge after reading this article.

The first part: the need for "quite individuality"

Customers use managed K8s cluster products to deploy test clusters in the cloud environment. Due to business needs, R & D colleagues need to be able to directly access the clueterIP type service of K8s cluster and back-end pod in the office network environment. Usually, the pod of K8s can only be accessed within the cluster through other pod or cluster node, not directly outside the cluster. When pod provides services inside and outside the cluster, it needs to expose access addresses and ports through service. Service not only acts as an access portal for pod applications, but also probes the corresponding ports of pod to achieve health check. At the same time, when there are multiple Pod at the backend, service will also forward client requests to different pod according to the scheduling algorithm to achieve load balancing. The common types of service are as follows:

Introduction to service Typ

1. ClusterIP type. If you do not specify a type when creating a service, the service of this type of service,clusterIP will be created by default. The service of this type can only be accessed by pod and node within the cluster through clusterIP, but not outside the cluster. Usually, service, such as K8s cluster system service kubernetes, does not need to provide services outside the cluster, but only needs to be accessed within the cluster.

2. Nodeport type. In order to solve the access demand for service outside the cluster, the nodeport type is designed to map the port of service to the port of each node in the cluster. When the service is accessed outside the cluster, the request is forwarded to the backend pod through the access to the node IP and the designated port

3. Loadbalancer type, which usually needs to call the cloud vendor's API API to create a load balancer product on the cloud platform and create a listener based on the settings. Within K8s, the loadbalancer type service actually maps the service port to the fixed port of each node just like the nodeport type. Then set the node as the backend of the load balancer, and the listener forwards the client request to the service mapping port on the backend node. After the request reaches the node port, it is forwarded to the backend pod. The service of Loadbalancer type makes up for the deficiency that when the nodeport type has multiple nodes, the client needs to access the IP addresses of multiple nodes, as long as the IP of the LB is accessed uniformly. At the same time, service of LB type is used to provide services. K8s nodes do not need to bind public network IP, but only need to bind public network IP to LB, which improves node security and saves public network IP resources. The high availability of services can be achieved by using the health check function of LB to the back-end nodes. Avoid a K8s node failure that causes the service to be inaccessible.

Summary of Part1

Through the understanding of the service type of K8s cluster, we can know that customers want to access service outside the cluster, and service of LB type is recommended first. Currently, nodes in K8s cluster products do not support binding to public network IP, so service of nodeport type cannot be accessed through public network. Customers can access nodeport type service only if they use direct connect or IPSEC to connect their office network with cloud network. For pod, you can only use other pod or cluster nodes to access it within the cluster. At the same time, the clusterIP and pod of K8s cluster are designed not to allow external access to the cluster, but also for the sake of improving security. If access restrictions are broken, security problems may occur. Therefore, we recommend that customers use LB-type service exposure service, or connect to the NAT host of the K8s cluster from the office network, and then connect to the K8s node through the NAT host, and then access the service of clusterIP type, or access the back-end pod.

Customers say that there are hundreds of clusterIP type services in the test cluster. If all of them are transformed into LB type service, it is necessary to create hundreds of LB instances and bind hundreds of public network IP. This is obviously unrealistic, and the workload of transforming all of them into Nodeport type service is also huge. At the same time, if you log in to the cluster node through the NAT host jump, you need to provide the R & D colleagues with the system password of the NAT host and the cluster node, which is not conducive to operation and maintenance management, and it is not as convenient as the R & D to access service and pod directly through the network.

The second part: the method is always more than the difficulty?

Although the customer's access mode violates the design logic of the K8s cluster and appears to be somewhat "non-mainstream", it is also a strong demand for the customer's use scenario. As technical engineers in Taiwan, we will try our best to help customers solve technical problems! Therefore, we plan the implementation scheme according to the customer's needs and scenario architecture.

Since it is a network connection, we should first analyze the network architecture of the K8s cluster from the customer's office network and cloud. The customer office network has a unified public network egress device, and the network structure of the K8s cluster on the cloud is as follows. The master node of the K8s cluster is not visible to the user. After the user creates the K8s cluster, three subnets will be created under the VPC network selected by the user. These are the Node subnets for K8s node communication, the NAT and LB subnets for deploying NAT hosts and load balancer instances created by LB type serivce, and the pod subnets for pod communication. The nodes of the K8s cluster are built on the CVM, and the next hop of the route for the node subnet to access the public network address points to the NAT host, which means that the cluster node cannot be bound to the public network IP. The NAT host is used as a unified public network access exit to perform SNAT to achieve public network access. Because the NAT host only has the SNAT function and does not have the DNAT function, it is not possible to access the node node from the NAT host outside the cluster.

With regard to the planning purpose of the pod subnet, we should first introduce the network architecture of pod on the node. As shown in the following figure:

On the node, the container in the pod is connected with the docker0 device through the veth pair, while the docker0 is connected with the node's network card through the self-developed CNI network plug-in. In order to separate the cluster control traffic from the data flow and improve the network performance, the cluster binds the elastic Nic on each node separately, which is specially for pod communication. When you create a pod, an IP address is assigned to the Pod on the elastic Nic. A maximum of 21 IP can be assigned to each ENI. When the IP on one ENI is fully allocated, a new Nic will be bound for subsequent new pod use. The subnet to which the elastic network card belongs is the pod subnet. Based on this architecture, the load pressure of the node eth0 main network card can be reduced, and the control traffic can be separated from the data flow. At the same time, the IP of pod has the actual corresponding network interface and IP in the VPC network, which can realize the routing of pod addresses in the VPC network.

The way you need to know how to get through

After understanding the network architecture at both ends, let's choose the way to get through. Usually connect the cloud network with the cloud network, with direct connect products or user-built VPN connection. To connect the direct connect product, you need to set up a network direct connect from the customer's office network to the computer room on the cloud, and then configure the route to each other at the network exit equipment on the customer's office network side and the bgw border gateway on the cloud network side. As shown in the following figure:

Due to the functional limitations of the existing Direct Connect product BGW, the route on the cloud can only point to the VPC where the K8s cluster resides, and cannot point to a specific K8s node. To access the clusterIP types service and pod, you must access them from nodes in the cluster and pod. So the next hop of the route to service and pod must be a cluster node. Therefore, the use of dedicated line products is obviously unable to meet the demand.

Let's take a look at the self-built VPN. The self-built VPN has an endpoint device with a public network IP in the customer's office network and a cloud network, and an encrypted communication tunnel is established between the two devices. The actual underlying layer is still based on public network communication. If this solution is used, we can select a CVM with a public network IP under a different subnet of the same VPC and the cluster node. After the access packets for service and pod at the office network are sent to the CVM through the VPN tunnel, the packet can be routed to a cluster node by configuring the subnet in which the CVM is located, and then the next hop of the route to the client in the subnet where the cluster node is located points to the endpoint CVM. At the same time, the same routing configuration needs to be done in the pod subnet. As for the implementation of VPN, through communication with customers, we choose ipsec tunnel mode.

After determining the scheme, we need to verify the feasibility of implementing the scheme in the test environment. Since we do not have a cloud environment, we select CVMs in different regions of the K8s cluster to replace the customer's office network endpoint equipment. Create a CVM office-ipsec-sh in East China and Shanghai to simulate the client office network client. Create a CVM K8s-ipsec-bj with public network IP in the NAT/LB subnet of the VPC where the K8s cluster K8s-BJTEST01 resides in North China and Beijing, simulate the ipsec endpoint in the customer scenario, and establish an ipsec tunnel with the East China Shanghai CVM office-ipsec-sh. Set the routing table of the NAT/LB subnet, and the next hop of the route added to the service segment points to the K8s cluster node K8s-node-vmlppp-bs9jq8pua, hereinafter referred to as node A. Since the pod subnet and the NAT/ LB subnet belong to the same VPC, there is no need to configure the route to the pod network segment. When accessing the pod, it will directly match the local route and forward it to the corresponding ENI. In order to return the data packet, the route to the Shanghai CVM office-ipsec-sh is configured in the node subnet and the office-ipsec-sh subnet, respectively, and the next hop points to K8s-ipsec-bj. The complete architecture is shown in the following figure:

The third part: "problems" in practice.

Now that we have decided on the plan, we begin to build the environment. First, create a K8s-ipsec-bj CVM in the NAT/LB subnet of the K8s cluster, and bind the public network IP. Then establish an ipsec tunnel with Shanghai CVM office-ipsec-sh. On the ipsec part of the configuration method on the network there are many documents, do not describe in detail here, interested children's shoes can refer to the documentation under their own practice. After the tunnel is established, if the private network IP of the opposite end of the ping at both ends can be ping, it is proved that the ipsec is working properly. Configure routing for NAT/LB subnets and node subnets as well as pod subnets as planned. In the serivce of the K8s cluster, select a serivce,clusterIP named nginx as 10.0.58.158, as shown in the figure:

The pod at the backend of the service is 10.0.0.13, deploy the nginx default page and listen on port 80. Test the IP 10.0.58.158 of ping service on the Shanghai CVM, you can connect with ping, and you can also use port 80 of the ping service using paping tools or ping!

Using curl http://10.0.58.158 to make a http request can also be successful!

If you test to access the backend pod directly, there is no problem:)

Just when the engineer was so happy that he thought everything was done, the result of the test interview with another service was like a basin of cold water. We then selected mysql as a service to test access to port 3306. The clusterIP of the serivce is 10.0.60.80, and the IP of the backend pod is 10.0.0.14.

There is no problem with direct ping service clusterIP in Shanghai CVM. But when paping 3306 port, unexpectedly does not work!

Then we test the back-end pod that accesses serivce directly. Oddly enough, the back-end pod, whether it is ping IP or paping 3306, can be connected!

What's going on?

What's going on here? After a comparative analysis by the engineer, it is found that the only difference between the two serivce is that the back-end pod 10.0.0.13 that can connect to the nginx service is deployed on node A to which the client request is forwarded. The backend pod of the unreachable mysql service is not on node A, but on another node. To verify whether this is the cause of the problem, we modify the NAT/LB subnet route separately so that the next hop to the mysql service points to the node where the backend pod is located. And test it again. Sure enough! You can now access port 3306 of the mysql service!

Part IV: three why?

At this moment, the engineer has three questions in mind:

(1) Why is it possible to connect when the request is forwarded to the node where the service backend pod is located?

(2) Why is it impossible to connect when the request is forwarded to a node where the service backend pod is not available?

(3) Why can the IP of service be ping no matter which node it is forwarded to?

In-depth analysis to eliminate question marks

In order to eliminate the small question mark in our minds, we should make an in-depth analysis, understand the cause of the problem, and then prescribe the right medicine to the case. Since you want to troubleshoot network problems, of course, you still have to come up with a classic magic weapon-tcpdump bag grabbing tool. In order to focus, we adjusted the architecture of the test environment. The existing architecture of the ipsec from Shanghai to Beijing remains unchanged. We expand the capacity of the K8s cluster node and create a new empty node K8s-node-vmcrm9-bst9jq8pua without any pod, hereinafter referred to as node B, which only forwards requests. Modify the NAT/LB subnet route so that the next hop of the route to the service address points to that node. For the service we tested, we selected the previously used nginx service 10.0.58.158 and the backend pod 10.0.0.13, as shown in the following figure:

When you need to test the scenario in which the request is forwarded to the node where the pod resides, we can change the next hop of the service route to K8s-node-A.

Everything is ready, let's start the journey of solving doubts! Go Go Go!

First of all, to explore the scenario of question 1, we execute the command on K8s-node-A to crawl the packet with the Shanghai CVM 172.16.0.50, as follows:

Tcpdump-i any host 172.16.0.50-w / tmp/dst-node-client.cap

Do you remember that we mentioned earlier that in the hosted K8s cluster, all pod data traffic is sent and received through the node's elastic Nic? The elastic Nic used by pod on K8s-node-An is eth2. We first use the curl command on the Shanghai CVM to request http://10.0.58.158, and execute the command to crawl whether pod 10.0.0.13 packets are sent and received on the eth2 of K8s-node-A. The command is as follows:

Tcpdump-I eth2 host 10.0.0.13

The result is as follows:

No 10.0.0.13 packets are sent and received from eth2, but the curl operation on the Shanghai CVM can be requested successfully at this time, indicating that 10.0.0.13 must be returned to the client, but not through eth2. Then we expand the scope of packet capture to all APIs with the command as follows:

Tcpdump-i any host 10.0.0.13

The result is as follows:

You can see that the interactive packets of 10.0.0.13 and 172.16.0.50 are indeed caught this time. For ease of analysis, we use the command tcpdump-i any host 10.0.0.13-w / tmp/dst-node-pod.cap to output the package as a cap file.

At the same time, we execute tcpdump-i any host 10.0.58.158 to grab the service IP package.

You can see that the packet can be caught when 172.16.0.50 executes the curl request, and there are only 10.0.58.158 packets interacting with 172.16.0.50, and there are no packets when the request is not executed. Since this part of the packet will be included in the capture of 172.16.0.50, we will no longer analyze it separately.

Take out the packet capture files for 172.16.0.50 and 10.0.0.13 and analyze them with wireshark tool. First, analyze the packet capture for client 172.16.0.50, as shown in the following figure:

It can be found that the client 172.16.0.50 first sent a package to service IP 10.0.58.158, and then sent a package to pod IP 10.0.0.13. The ID of the two packages are exactly the same. When the final packet is returned, pod 10.0.0.13 returns a package to the client, and then service IP 10.0.58.158 returns a package with exactly the same ID and content to the client. What is the cause of this?

From the previous introduction, we know that service forwards the client request to the backend pod. In this process, the client requests the IP of the service, and then the service will do DNAT (NAT forwarding according to the destination IP) to forward the request to the backend pod IP. Although we saw that the client sent the package twice to service and pod, in fact, the client did not reissue the packet, but service completed the destination address translation. When pod returns the packet, it also sends the packet back to service, and then service forwards it to the client. Because it is a request within the same node, this process should be completed in the node's internal virtual network, so we did not catch any data packets interacting with the client on the eth2 network card used by pod. Combined with the packet capture of pod dimension, we can see that the http get request packet captured for client can also be caught in the capture of pod, which also validates our analysis.

So which network interface does pod send and receive packets through? Execute the command netstat-rn to view the network routes on node A, and we have the following findings:

Within the node, all routes to 10.0.0.13 point to cni34f0b149874, the network interface. Obviously this interface is a virtual network device created by the CNI network plug-in. In order to verify whether all pod traffic is sent and received through this API, we request the service address on the client again. When node A grabs packets in the client dimension and pod dimension, but this time, we no longer use the-I any parameter, but replace it with-I cni34f0b149874 when capturing packets in the pod dimension. After analysis and comparison, we found that, as we expected, all the client request packets for pod can be found in the capture packets for cni34f0b149874. At the same time, for other network interfaces in the system except cni34f0b149874, we did not catch any data packets that interact with the client. So we can prove that our inference is correct.

To sum up, when the client request is forwarded to the node where the pod is located, the data path is shown in the following figure:

Next, we explore the most concerned problem 2 scenario, which modifies the NAT/ LB subnet route to the next hop of service to point to the new node node B, as shown in the figure

This time we need to grab the bag on both node B and node A. The service address is also requested in curl mode on the client side. On the forwarding node node B, we first execute the command tcpdump-I eth0 host 10.0.58.158 to grab the packet of the service dimension, and find that the request packet from the client to the service has been captured, but the service does not have any response packets, as shown in the figure:

Children's shoes may have doubts about why the grab is 10.0.58.158, but the destination shown in the grab bag is the roll call. In fact, this is related to the implementation mechanism of service. After the service is created in the cluster, the cluster network component selects a random port on each node to listen, and then configures the forwarding rules in the node's iptables. All requests for service IP in the node are forwarded to the random port, and then processed by the cluster network component. So when you access service within a node, you are actually accessing a port on the node. If you export the capture packet as a cap file, you can see that the destination IP of the request is still 10.0.58.158, as shown in the figure:

This also explains why clusterIP can only be accessed from nodes or pod within the cluster, because devices outside the cluster do not have the iptables rules created by K8s network components, and cannot convert the request service address to the port of the requesting node, even if the packet is sent to the cluster, because the clusterIP of service does not actually exist in the node's network, so it will be discarded. (the strange posture has increased again)

Going back to the problem itself, we grabbed the service-related packets on the forwarding node and found that service did not return packets to the client as it did when forwarded to the node where the pod is located. Let's execute the command tcpdump-i any host 172.16.0.50-w / tmp/fwd-node-client.cap to capture the packet in the client dimension. The package content is as follows:

After we found that the client request forwards service on node node B, service also does DNAT, forwarding the request to 10.0.0.13 on node A. However, no packet returned to the client by 10.0.0.13 was received on the forwarding node, and the client retransmitted the request packet several times without responding.

So did node A receive the request packet from the client? Did pod return a packet to the client? We moved to node A to grab the bag. We can learn from the packet capture on node B that there should be only client-side IP and pod IP interaction on node A, so we grab the package from these two dimensions. According to the analysis results of the previous packet capture, after the packet enters the node, it should interact with the pod through the virtual device cni34f0b149874. The node B node accessing pod should enter the node from node A's elastic network card eth2 instead of eth0. In order to verify, first execute the commands tcpdump-I eth0 host 172.16.0.50 and tcpdump-I eth0 host 10.0.0.13, and did not catch any packets.

Indicates that the packet did not go eth0. Then execute tcpdump-I eth2 host 172.16.0.50-w / tmp/dst-node-client-eth2.cap and tcpdump-I cni34f0b149874 host 172.16.0.50-w / tmp/dst-node-client-cni.cap respectively to capture the client dimension data packet, and find that the content of the packet is exactly the same, indicating that after the packet enters Node A from eth2, it is forwarded to cni34f0b149874 through the intra-system route. The contents of the packet are as follows:

You can see that after the client sends a packet to pod, pod returns the packet to the client. Execute tcpdump-I eth2 host 10.0.0.13-w / tmp/dst-node-pod-eth2.cap and tcpdump-I host 10.0.0.13-w / tmp/dst-node-pod-cni.cap to grab the pod dimension packet, and find that the content of the packet is exactly the same, indicating that the return packet sent by pod to the client is sent through cni34f0b149874, and then leaves the node A node from the eth2 network card. The content of the packet can also be seen that pod returns a packet to the client, but does not receive a response from the client to the return packet, triggering a retransmission.

So since the return packet for pod has been sent, why is it not received on node B and not received by the client? Looking at the pod subnet routing table to which the eth2 network card belongs, we suddenly realized!

Because the pod return packet to the client is sent from the eth2 Nic of node A, although according to the normal DNAT rules, the packet should be sent back to the service port on node B, but affected by the routing table of the eth2 subnet, the packet is directly "hijacked" to the host K8s-ipsec-bj. After the packet arrives on this host, because there is no service translation, the source address of the return packet is pod address 10.0.0.13, the destination address is 172.16.0.50, and the packet replies to the packet with source address 172.16.0.50 and destination address 10.0.58.158. This is equivalent to the inconsistency between the destination address of the request packet and the source address of the reply packet. For K8s-ipsec-bj, you can only see the reply packet from 10.0.0.13 to 172.16.0.50, but you have not received the request packet from 172.16.0.50 to 10.0.0.13. The mechanism of cloud platform virtual network is to encounter only reply packet, and if there is no request packet, the request packet will be discarded to avoid using address spoofing to launch network attacks. So the client will not receive a return packet of 10.0.0.13 and will not be able to complete the request for service. In this scenario, the path of the packet is shown in the following figure:

At this point, the reason why the client can successfully request pod is also clear. The data path for requesting pod is as follows:

The path of the request packet and the return packet is the same, both pass through the K8s-ipsec-bj node and the source destination IP has not changed, so the pod can be connected.

See here, witty children's shoes may have thought, then modify the pod subnet route to which eth2 belongs, so that the next hop of the packet to 172.16.0.50 is not sent to K8s-ipsec-bj, but returned to K8s-node-B, can not let the return packet return along the way, will not be discarded? Yes, it has been verified by our tests that this does enable the client to request the service successfully. But don't forget, another requirement of the user is that the client can access the back-end pod directly. If the pod return packet is returned to node B, what is the data path when the client requests pod?

As shown in the figure, you can see that after the client request for Pod arrives at K8s-ipsec-bj, because it is accessed by the address within the same vpc, it is forwarded directly to the node An eth2 network card in accordance with the local routing rules. When the pod returns the packet to the client, it is controlled by the eth2 network card and sent to node B. Node B has not received the request packet from the client to pod before, and it will also encounter the problem that only the reply package does not have the request packet, so the return packet is discarded and the client cannot request pod.

At this point, we have figured out why the client request cannot successfully access the service when forwarded to a node where the service backend pod is not present. So why is it that although the request for the port of service fails, it is possible to ping the service address? The engineer infers that since service acts as a DNAT and load balancer for the backend pod, when the client ping service address, the ICMP packet should be answered by the service directly, that is, service instead of the backend pod replies to the client's ping packet. To verify whether our inference is correct, we create an empty service in the cluster that is not associated with any backend, as shown in the figure:

Then on the client side ping 10.0.62.200, the result is as follows:

Sure enough, even if the service backend does not have any pod, it can still communicate with ping, so it is proved that all ICMP packets are picked up by service, and there is no problem when the backend pod is actually requested, so it can be connected with ping.

Part V: there is no other way.

Now that we have gone to great pains to find the reason for the failure of the visit, we must find a way to solve the problem. In fact, as long as you find a way to make pod hide its own IP when returning packets to clients across nodes, and display the IP of service to the outside, you can prevent packets from being discarded. Similar in principle to SNAT (address translation based on source IP). It can be compared to that a LAN device without a public network IP has its own private network IP. When accessing the public network, it needs to go through a unified public network exit. At this time, the client IP seen externally is the IP of the public network exit, not the private network IP of the LAN device. To implement SNAT, we first think of the iptables rules on the node operating system. We execute the iptables-save command on node A, the node where pod is located, to see what iptables rules the system already has.

Knock on the blackboard. Pay attention.

You can see that the system has created nearly a thousand iptables rules, most of which are related to K8s. We focus on the nat type rules in the figure above, and find that the following items have attracted our attention:

First of all, let's look at some of the rules in the red box.

-A KUBE-SERVICES-m comment-- comment "Kubernetes service cluster ip + port for masquerade purpose"-m set-- match-set KUBE-CLUSTER-IP src,dst-j KUBE-MARK-MASQ

This rule indicates that if the source or destination address accessed is a cluster ip + port, it will jump to the KUBE-MARK-MASQ chain for masquerade purposes. Masquerade means address camouflage! Address camouflage is used in NAT translation.

Next, let's look at some of the rules in the blue box.

-A KUBE-MARK-MASQ-j MARK--set-xmark 0x4000/0x4000

This rule indicates that for packets marked with 0x4000/0x4000, address camouflage is required.

Finally, look at the yellow box rules.

-A KUBE-POSTROUTING-m comment-- comment "kubernetes service traffic requiring SNAT"-m mark-- mark 0x4000/0x4000-j MASQUERADE

This rule indicates that for packets marked as 0x4000/0x4000 requiring SNAT, they will jump to the MASQUERADE chain for address camouflage.

What these three rules do seems to be exactly what we need iptables to do for us, but it is clear from previous tests that these three rules are not in effect. Why is that? Is there a parameter in the network component of K8s that controls whether packets accessing clusterIP will be SNAT?

This needs to be studied from the working mode and parameters of kube-proxy, which is responsible for network proxy forwarding between service and pod. We already know that service will load balance and proxy forward the backend pod. To achieve this function, we rely on the kube-proxy component, which is a proxy network component from the name. It runs on each K8s node in the form of pod. When accessed through the clusterIP+ port of service, the request is forwarded to the corresponding random port on the node through iptables rules, and then the request is handled by the kube-proxy component and forwarded to the corresponding back-end Pod through the routing and scheduling algorithm within kube-proxy. Initially, the working mode of kube-proxy was userspace (user space agent) mode, and the kube-proxy process was a real TCP/UDP agent during this period, similar to HA Proxy. Since this model has been replaced by iptables mode since K8s version 1.2, interested children's shoes can be studied on their own.

The iptables mode introduced in version 1.2 is the default mode of kube-proxy. Kube-proxy itself no longer acts as a proxy, but forwards traffic from service to pod by creating and maintaining corresponding iptables rules. However, there are inevitable defects in relying on iptables rules to implement agents. After a large number of service and pod in the cluster, the number of iptables rules will increase sharply, which will lead to a significant decline in forwarding performance, and even rule loss will occur in extreme cases.

In order to solve the disadvantages of iptables mode, K8s began to introduce IPVS (IP Virtual Server) mode in version 1.8. IPVS pattern is designed for high-performance load balancing, using more efficient hash table data structures, providing better scalability and performance for large clusters. It supports more complex load balancing scheduling algorithms than iptables mode. The kube-proxy that hosts the cluster uses the IPVS pattern.

However, IPVS mode does not provide functions such as packet filtering, address masquerading and SNAT, so in scenarios where these functions are needed, IPVS should be used with iptables rules. Wait, address camouflage and SNAT, isn't that what we saw in the iptables rules before? That is to say, iptables will not follow the corresponding iptables rules when not doing address masquerading and SNAT, but once a parameter is set to enable address masquerading and SNAT, the iptables rules seen before will take effect! So we went to the kubernetes website to look up the working parameters of kube-proxy, and made an exciting discovery:

What a sudden look back! The engineer's sixth sense tells us that the masquerade-all parameter is the key to solving our problem!

Part VI: true methods are more difficult than difficulties.

We decided to test the parameter on-- masquerade-all. Kube-proxy runs as pod on each node in the cluster, while the parameter configuration for kube-proxy is mounted to pod as configmap. We execute kubectl get cm-n kube-system to view the configmap of kube-proxy, as shown in the figure:

In the red box is the configuration configmap of kube-proxy. Execute kubectl edit cm kube-proxy-config-khc289cbhd-n kube-system to edit the configmap, as shown in the figure

Found the masqueradeALL parameter, the default is false, we changed it to true, and then save the changes.

To make the configuration effective, you need to delete the current kube-proxy pod,daemonset one by one. The pod will be automatically rebuilt, and the reconstructed pod will mount the modified configmap,masqueradeALL function. As shown in the figure:

Rub your hands expectantly

Then comes the exciting moment when we point the route to service to node B, and then execute paping 10.0.58.158-p 80 on the Shanghai client to observe the test results (rub hands expectantly):

This scene, can not help but let the engineer shed tears of joy.

Another test of curl http://10.0.58.158 can also be successful! Ollie gives ~

There is no problem with directly accessing the back-end Pod and forwarding the request to the node where the pod resides. At this point, the customer's needs are finally understood, and there is a sigh of relief!

Grand finale: know why

Although the problem has been solved, our exploration is not over yet. When the masqueradeALL parameter is enabled, how does service SNAT the packet to avoid the previous packet loss problem? Or by grabbing the bag for analysis.

First, analyze the scenario when forwarding to a node where the pod is not available. When the client requests a service, the client IP is grabbed at the node where the pod resides, but no packets are caught.

When the parameter is enabled, the request to the backend pod is no longer initiated by the client IP.

The interactive packet between the service port of the forwarding node and the pod can be captured by grabbing the pod IP packet at the forwarding node.

This indicates that pod did not return the packet directly to the client 172.16.0.50. From this point of view, it is equivalent to that the client and pod are unaware of each other, and all interactions are forwarded through service.

Then capture the packet to the client at the forwarding node. The packet content is as follows:

At the same time, grab the pod packet on the node where the pod resides. The package content is as follows:

You can see that after the forwarding node receives the curl request packet with the sequence number 708, the node where the pod is located receives the request packet with the same sequence number, but the source destination IP is converted from 172.16.0.50 IP 10.0.58.158 to 10.0.32.23 IP 10.0.0.13. Here 10.0.32.23 is the private network IP of the forwarding node, which is actually the random port corresponding to the service on the node, so it can be understood that the source and destination IP is converted to 10.0.58.158 IP 10.0.0.13. The process is the same when the packet is returned. Pod sends out the packet with sequence number 17178, and the forwarding node sends the packet with the same sequence number to the client. The source destination IP is converted from 10.0.0.13prime 10.0.58.158 to 10.0.58.158 172.16.0.50.

According to the above phenomenon, service SNAT both the client and the backend, which can be understood as turning off the load balancer of the transparent client source IP, that is, neither the client nor the backend knows the existence of each other, but only the address of the service. The data path in this scenario is shown below:

The request for Pod does not involve SNAT conversion, which is the same as when the masqueradeALL parameter is not enabled, so we will no longer analyze it.

When the client request is forwarded to the node where the pod is located, the service still performs the SNAT conversion, but the process is done inside the node. Through the previous analysis, we have also known that when the client request is forwarded to the node where the pod is located, whether the SNAT is carried out has no effect on the access result.

On how to analyze a customer demand triggered by the K8s network exploration is shared here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.