How to understand the advance of Kubernetes Network Model 07/09 Update SLTechnology News&Howtos

How to understand the advance of Kubernetes Network Model

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article shows you how to understand the advanced Kubernetes network model, the content is concise and easy to understand, it will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

The context of Kubernetes network model

The container network originated from the network of Docker. Docker uses a relatively simple network model, that is, an internal bridge plus an internal reserved IP. The advantage of this design is that the network of the container is decoupled from the outside world, and there is no need to take up the IP or resources of the host, which is completely virtual. Its original design intention is: when the need to access the outside world, will use SNAT this method to borrow Node's IP to access external services. For example, when a container needs to provide services, it uses DNAT technology, that is, open a port on Node, and then import the flow into the process of the container through iptable or some other mechanism to achieve the goal.

The problem with this model is that external networks cannot distinguish between container networks and traffic, and which are host networks and traffic. For example, if you want to make a high availability, 172.16.1.1 and 172.16.1.2 are two containers with the same function, and we need to tie them into a Group to provide services. At this time, we find that they have nothing in common from the outside. Their IP is borrowed from the host port, so it is difficult to bring the two together.

On this basis, Kubernetes proposed such a mechanism: every Pod, that is, a small group of functional aggregators, should have its own "ID card", or ID. On the TCP protocol stack, this ID is IP.

This IP really belongs to the Pod, and the outside world must give it to it in any way. Access to this Pod IP is real access to its services, rejecting any modifications in the middle. For example, if you use the IP of 10.1.1.1 to access the Pod of 10.1.2.1, it turns out that it actually borrows the host's IP instead of the source IP, which is not allowed. The sharing of this IP is required within Pod, thus solving the problem of how some functionally cohesive containers become a deployed atom.

The remaining problem is our means of deployment. In fact, Kubernetes has no restrictions on how to implement this model, and it is possible to use underlay network to control the diversion of external routers; if you want to decouple, use overlay network to add a layer of overlay network on top of the underlying network. In short, as long as the goal required by the model is achieved.

How on earth does Pod surf the Internet

How on earth is the network packet of the container network transmitted?

We can look at it from the following two dimensions:

Protocol level

Network topology

The first dimension is the protocol level.

The concept of it is the same as that of TCP protocol stack. It needs to be stacked from two, three or four layers. When sending a packet, it will be sent from right to left, that is, the application data will be sent first, and then sent to the four-layer protocol of TCP or UDP. Then it can be sent down, plus IP header and MAC header. When receiving packets in reverse order, first peel off the header of MAC, then peel off the header of IP, and finally find the process that needs to be received on the port through the protocol number.

The second dimension is the network topology.

The problem to be solved by a container package is divided into two steps: the first step is how to jump from the container space (C1) to the host space (infra); the second step is how to reach the remote end from the host space.

My personal understanding is that the solution of container network can be considered through three levels: access, flow control and channel.

The first is access, that is, which mechanism is used to connect our container and host, such as classic methods such as Veth + bridge and Veth + pair, as well as other ways (such as mac/IPvlan, etc.) to send packets to the host space, such as the new mechanism of high-version kernel.

The second is flow control, that is, whether my scheme should support Network Policy, and if so, how to implement it. It should be noted here that our implementation must be at a key point where the data path must pass. If the data path does not pass through the Hook point, it will not work

The third is the channel, that is, how the packets are transmitted between the two hosts. We have many ways, such as routing, which can be divided into BGP routing or direct routing. There are all kinds of tunnel technologies and so on. In the end, our goal is that the packet in a container passes through the container, passes through the access layer to the host, and then traverses the host's flow control module (if any) to reach the channel and send it to the opposite end.

One of the simplest routing schemes: Flannel-host-gw

This solution uses an exclusive network segment of each Node, each Subnet is tied to a Node, and the gateway is also set locally, or directly on the internal port of the cni0 bridge. The advantage of this scheme is that it is easy to manage, but the downside is that Pod cannot be migrated across Node. That is to say, after the IP and network segment already belong to this Node, they cannot be migrated to other Node.

The essence of this solution lies in the setting of the route table, as shown in the figure above. Next, I will interpret it for you one by one.

The first is very simple, we will add this line when we set up the network card. Is to specify which IP my default route is going through and what the default device is.

The second is a rule feedback to Subnet. That is to say, my network segment is 10.244.0.0, the mask is 24 bits, and its gateway address is on the bridge, that is, 10.244.0.1. This means that every packet of this network segment is sent to the IP of this bridge.

The third is a feedback to the peer. If your network segment is 10.244.1.0 (Subnet on the right of the image above), we will use the IP (10.168.0.3) on its Host network card as the gateway. In other words, if the packet is destined for the 10.244.1.0 segment, use 10.168.0.3 as the gateway.

Let's take a look at how this packet actually runs.

Suppose the container (10.244.0.2) wants to send a packet to 10.244.1.3, then after it generates a TCP or UDP packet locally, fill in the peer IP address, the MAC address of the local Ethernet as the source MAC, and the peer MAC. Generally speaking, a default route is set locally, and the default route takes the IP on the cni0 as its default gateway, and the peer MAC is the MAC address of this gateway. Then the bag can be sent to the bridge. If the network segment is on this bridge, it can be solved through the exchange of the MAC layer.

In this example, our IP does not belong to the local network segment, so the bridge will send it to the host's protocol stack for processing. The host protocol stack happens to find the MAC address of the peer. Using 10.168.0.3 as its gateway, we get the MAC address of 10.168.0.3 after probing through the local ARP. That is, through the layer-by-layer assembly of the protocol stack, we achieve our goal by filling in the Dst-MAC as the MAC address of the host network card on the right, so as to send the packet from the eth0 of the host to the eth0 of the opposite end.

So you can find that there is an implicit limitation. After filling in the MAC address in the figure above, you must be able to reach the peer. However, if the two hosts are not connected at layer 2 and go through some gateways and some complex routes, then the MAC cannot be reached directly, and this method cannot be used. When the packet arrives at the peer's MAC address, it is found that the packet is indeed for it, but the IP is not its own, so it starts the Forward process, sends the packet to the protocol stack, and then goes through the route again. It happens to find that 10.244.1.0 MAC 24 needs to be sent to the gateway 10.244.1.1, thus reaching the cni0 bridge. It will find the corresponding MAC address of 10.244.1.3, and then go through the bridging mechanism. This package reaches the opposite container.

As you can see, the whole process is always two or three layers, and when it is sent, it becomes a second layer, and then it is routed, which is a big ring and a small ring. This is a relatively simple solution. If there is a tunnel in the middle, there may be a vxlan tunnel device. At this time, instead of filling in the direct route, it will be filled with the tunnel number of the opposite end.

How exactly does Service work

Service is actually a load balancing (Load Balance) mechanism.

We think of it as a user-side (Client Side) load balancing, that is, the conversion from VIP to RIP is completed on the user side, and there is no need to centrally reach a component such as a NGINX or an ELB to make a decision.

It is implemented as follows: first, a group of Pod forms a set of functional backend, and then a virtual IP is defined on the front end as the access entry. Generally speaking, because IP is not easy to remember, we will also attach a domain name of DNS. Client first accesses the domain name to get a virtual IP and then converts it into a real IP. Kube-proxy is the core of the whole mechanism, which hides a lot of complexity. It works by monitoring Pod/Service changes through apiserver (for example, whether Service or Pod has been added) and feeding them back to local rules or user-mode processes.

A LVS version of Service

Let's actually make a LVS version of Service. LVS is a kernel mechanism dedicated to load balancing. It works at layer 4 and performs better than using iptable.

Suppose we are a Kube-proxy and get a Service configuration, as shown in the following figure: it has a Cluster IP, port 9376 on this IP, port 80 that needs to be fed back to the container, and three working Pod with IP of 10.1.2.3, 10.1.14.5 and 10.1.3.8, respectively.

What it has to do is:

Step 1, bind VIP locally (spoofing the kernel)

First of all, you need to convince the kernel that it has such a virtual IP, which is determined by the working mechanism of LVS, because it works at layer 4 and does not care about IP forwarding. Only if it thinks the IP is its own will it split to the TCP or UDP layer. In the first step, we set the IP into the kernel and tell the kernel that it does have such an IP. There are many ways to achieve this. We use ip route to add local directly here, and it is also possible to add IP to Dummy dumb devices.

Step 2, create an IPVS virtual server for this virtual IP

Tell it that I need to do a load-balanced distribution for this IP, followed by some distribution strategies, and so on. Virtual server's IP is actually our Cluster IP.

Step 3, create the corresponding real server for the IPVS service.

We need to configure the corresponding real server for virtual server, which is exactly what the back end of the service is. For example, we just saw that there are three Pod, so we match these three IP to the virtual server and correspond to each other completely. Kube-proxy work is similar to this. But it also needs to monitor some Pod changes, such as the number of Pod has become 5, then the rules should be changed to 5. If one of these Pod dies or is killed, you should subtract one accordingly. Or if the whole Service is revoked, then all these rules will be deleted. So it actually does some work at the management level.

What? Load balancing is also divided into internal and external.

Finally, we introduce the types of Service, which can be divided into the following four categories.

ClusterIP

A virtual IP within the cluster, this IP will be bound to the Group Pod of a bunch of services, which is also the default service method. Its disadvantage is that this approach can only be used within the Node, that is, within the cluster.

NodePort

For external calls to the cluster. Service is hosted on the static port of Node, and the port number corresponds to Service one by one, so that users outside the cluster can call Service through:.

LoadBalancer

An extended interface for cloud vendors. Cloud vendors such as Aliyun and Amazon have mature LB mechanisms, which may be implemented by a large cluster. In order not to waste this capability, cloud vendors can expand through this interface. First of all, it will automatically create NodePort and ClusterIP mechanisms. Cloud vendors can choose to directly attach LB to these two mechanisms, or neither. It is also possible to directly attach Pod RIP to the backend of cloud vendors' ELB.

ExternalName

Abandon the internal mechanism and rely on external facilities, such as a user who is particularly strong. He thinks that what we provide is useless, just to implement it on his own. In this case, a Service will correspond to a domain name one by one, and the whole load balancing work is implemented externally.

The following figure is an example. It flexibly applies a variety of services such as ClusterIP, NodePort and so on, and combines the ELB of cloud vendors to become a very flexible, extremely scalable and truly available system in production.

First of all, we use ClusterIP as the service entry of the functional Pod. As you can see, if there are three kinds of Pod, there are three Service Cluster IP as their service entrances. These methods are all on the Client side, so how to do some control on the Server side?

First, there will be some Ingress Pod (Ingress is a new service added by K8s, which is essentially a bunch of homogeneous Pod), and then the Pod is organized and exposed to a NodePort IP,K8s.

Any user who accesses Pod on port 23456 will access the service of Ingress. There is a Controller behind it, which will manage the backend of Service IP and Ingress, and finally transfer it to ClusterIP, and then to our function Pod. As mentioned earlier, when we connect to the ELB of the cloud vendor, we can let ELB listen on port 23456 on all cluster nodes. As long as there is a service on port 23456, we think that there is an instance of Ingress running.

The entire traffic reaches the cloud vendor's ELB,ELB through a resolution and diversion of an external domain name, and then reaches the Ingress,Ingress through NodePort, and then calls to the real Pod at the backend through ClusterIP. This system looks rich and robust. There is no single point problem in any link, and there is management and feedback in any link.

Summary

This is the end of the main content, here is a brief summary for you:

We should fundamentally understand the evolution of the Kubernetes network model and understand where the intention of PerPodPerIP lies.

The things of the network remain the same. According to the model, from four layers down is the process of sending packets. Anyway, layer by layer stripping is the process of receiving packets, and so is the container network.

Mechanisms such as Ingress make it easier for people to deploy cluster external services at a higher level (service port). Through a truly available deployment example, we hope that you can combine the concepts such as Ingress+Cluster IP + PodIP to understand the thinking of community publishing new mechanisms and new resource objects.

The above content is how to understand the advancement of the Kubernetes network model. Have you learned the knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.