How to implement the Kubernetes Container Network Model 04/09 Update SLTechnology News&Howtos

How to implement the Kubernetes Container Network Model

2025-04-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article focuses on "how to implement the Kubernetes container network model". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn how to implement the Kubernetes container network model.

Container-to-Container networking between Pod

Linux networking namespace provides a logical network stack for process communication, including network devices, routes, and firewall rules. Network namespace (NS) management actually provides an independent logical network Stack for all processes in it.

By default, Linux mounts each process under Root NS, and these processes go to the outside world through eth0.

In the Pod world, all of these containers share a NS, they all have the same IP and Port space, and access through localhost is interoperable. Shared storage is also accessible and mounted to the container through SharedVolume. The following is a NS per pod legend:

Same as Pod-to-Pod networking in Node

First, how to implement the networking between Pod under the same Node? The answer is to connect across NS through two pieces of Virtual interfaces of Virtual Ethernet Device (or veth pair), each of which is mounted on a NS. For example, one piece is hung on the Root NS (host) and the other on the Pod NS, as if a network cable connects two traffic in different cyberspaces, as shown in the figure:

With the veth pair network cable, the Pods network can be connected to the Root NS, but how to realize the packet communication from different Pod on Root NS? The answer is to implement Ethernet packet switching between different network segments through Linux Ethernet Bridge, a virtual Layer2 network device. We have to mention this old-school protocol: ARP, which implements the discovery protocol from MAC address to IP address. Bridge broadcasts ethframe to all connected devices (except the sender) and packet forward to the corresponding veth device upon receipt of the ARP reply. As shown in the figure:

Cross-Node Pod-to-Pod networking

Before entering this section, I first mention the three fundamental requirements of K8s on its (Pod) networking design, and any implementation of the networking part must follow these three requirements.

Without using NAT, all Pods can communicate with any other Pods

Without using NAT, all Nodes can communicate with all Pods

The IP seen by Pod must be the same as the IP seen by other Pods.

In a nutshell, the K8s network model requires Pod IP to be accessible throughout the network. The specific implementation plan has three aspects:

Layer2 (Switching) Solution

Layer3 (Routing) Solution, such as Calico, Terway

Overlay Solution, such as Flannel

This section is described below, and at present, it is believed that the network accessibility of Pod IP is guaranteed.

Before Pod acquires IP, kubelet assigns a CIDR address field (Classless inter-domain routing) to each Node, in which the size of a unique IP,CIDR address block obtained by each Pod corresponds to the maximum number of Pod per Node (default 110s). After successful deployment of Pod IP and cross-Node network layers, the communication from the source Pod1 to the destination Pod4 is shown in the figure:

Pod-to-Service networking

K8s Service manages the Pods state of the service, manages the change of the corresponding IP under the change of Pod, and manages the Virtual IP to Pod IPs routing access to the external service, so as to realize the external access to the service Virtual IP to the Pod IP, so as to shield the external implementation form to the service back-end. So when the service is created, a Virtual IP (that is, Cluster IP) is generated, and any access to that Virtual IP will be fragmented and routed to the Pods to which the service belongs.

How does the service of K8s achieve load balancing of access to Virtual IP? The answer is netfilter and iptables. Netfilters is a Linux built-in networking framework that provides rich custom handler implementations such as network packet filtering, NAT, and Port translation for Linux. Iptables is a rule management system running on Linux user-space, which provides rich packet forwarding rule management for the netfilter framework.

In the K8s implementation, kube-proxy (node deamon) uses watch apiserver to obtain changes in service configuration, such as Virtual IP changes in services, Pod IP changes (ie, pod up/down). The iptables rules change and the request is routed to the corresponding Pod of the service, and the Pod IP selection is random, so iptables plays the role of Pod load balancing. On the access request Return path, iptables does a SNAT to replace the Pod IP of IP header as the service Virtual IP, which makes it appear that the Client request communicates only on the service Virtual IP.

From K8S v1.11, IPVS (IP Virtual Server) is introduced as the second load balancing method within the cluster. IPVS is also built on top of netfilter, and either iptables or IPVS can be specified when creating a service definition. IPVS is a specific solution suitable for service load balancing, which provides a very rich application scenario of balancing algorithms.

Use DNS

Each service sets a DNS domain name, and kubelets configures it for each container-cluster-dns=, is used to resolve the corresponding DNS domain name of the service to the corresponding Cluster IP or Pod IP. After 1.12, CoreDNS becomes the default DNS mode. The service supports three types of DNS records (A record, CNAME, SRV records). A Records is commonly used. For example, under the DNS name of cluster.local, A record format such as pod-ip-address.my-namespace.pod.cluster.local, where Pod hostname and subdomain fields can be set to the standard FQDN format, such as custom-host.custom-subdomain.my-namespace.svc.cluster.local

CNI

The container network model is implemented by the cooperation of the node Pod resource control (kubelet) of K8s and the plug-in that complies with Container Networking Interface (CNI) standard. The CNI plug-in acts as a "glue": various container network implementations can be uniformly controlled and scheduled by kubelet under a consistent operation interface. In addition, multiple container networks can also coexist in a cluster to provide services for the network needs of different Pod, all under the unified control of kubelet.

Overlay networking: Flannel

Flannel is a solution developed by CoreOS for K8s networking and a container network solution supported by Aliyun ACK products. The design principle of Flannel is very simple. It creates another flat network (so-called overlay) on top of the host network, and sets an IP address for each pod container in the address space above it, and uses this to achieve routing and communication.

The host container network communicates on docker bridge docker0, so I won't repeat it any more. The communication between hosts is realized by kernel routing table and IP-over-UDP encapsulation. The container IP packet flows through docker bridge and is forwarded to the flannel0 network card (TUN) device, which in turn flows into the flanneld process. Flanneld will query the information of the network segment to which the packet target IP address belongs to the corresponding next-hop host IP, and the mapping (key-value) between the container subnet CIDR and the host IP is saved in etcd. After the flanneld query obtains the host IP address to which the packet target IP belongs, it will encapsulate the IP packet into a UDP payload and set the UDP packet target address to the resulting target host IP, and finally send UDP packet in the host network. After reaching the destination host, UDP packet will flow through flanneld and unblock the IP packet here, then send it to flannel0, docker0 and finally to the destination container IP address. The following figure illustrates the process:

It is worth mentioning that there is no special limit on the capacity of mapping entries for container CIDR and next-hop host IP. On Aliyun ACK products, the capacity of this entry needs to be distributed in the VPC/vSwitch control plane. Taking into account the overall performance factors, there is a certain quantity limit (48 by default). However, in the deployment of self-built host network, this number limit will not be obvious, because the host next hop host network is on a large two-tier plane.

UDP encapsulation is not recommended for the new version of backend of Flannel, because there are three copies of data in user space and kernel space in traffic, resulting in a large loss of performance (as shown in the following figure). The new version is recommended to use VxLan and cloud service provider version of backends for optimization.

L3 networking: Calico

Calico is a very popular container network architecture scheme on L3 Routing. The main components are Felix,BIRD and BGP Route Reflector. Both Felix and BIRD are deamon programs running on Node. Brief Architecture:

Felix completes the management and configuration of network card, including Routes programming and ACLs. Realize the operation of routing information to Linux kernel FIB and the management operation of ACLs. Because the functional integrity and operational independence of Felix is very good, its function is integrated into the Aliyun Terway network plug-in as Off-the-shelf to realize its network policy function.

BIRD (BGP client) completes the distribution of kernel routing FIB entries to the cluster network side, makes its routing entries visible to all network nodes, and implements the function of BGP routing protocol. Each BGP client connects to other BGP client in the network, which can be an obvious bottleneck for larger deployments (due to the N ^ 2 increase nature). In view of the introduction of the BGP Route Reflector component in this limitation, the BGP clients routing information is then distributed (propagation) on the aggregation layer. Multiple Reflector components can be deployed in a cluster website, depending on the size of the deployment. The Reflector component only performs the distribution of routing signaling and entries, which does not involve any data plane traffic. Route aggregation layer distribution:

L3 networking:Terway

Terway is a self-developed CNI plug-in for Aliyun, which provides an infrastructure for VPC interworking and easy interfacing of Aliyun products, without the performance loss caused by overlay network, and provides easy-to-use backend features.

Terway function can be divided into three parts: 1. CNI plug-in, an independent binary running program; 2. Backend Server (also known as daemon), the program runs on each Node in an independent daemonSet manner; 3. Network Policy, fully integrated with the Calico Felix implementation.

CNI plug-in binary is installed on all nodes through initContainer in daemonSet deployment, and three interfaces of ADD, DEL and CHECK are implemented for kubelet to call. This is illustrated by the network setup steps of a Pod during the creation process:

When a Pod is dispatched to a node, the kubelet listens that the Pod is created on its own node, using runtime (docker...) Create a sandbox container to get through the required namespace.

Kubelet calls the cmdAdd interface of the plug-in binary. After checking the interface parameters, the plug-in initiates an AllocIP call to backendServer.

The networkService of the backendServer program applies for Pod IP according to the network type of Pod, and supports three network types: ENIMultiIP, VPCENI, and VPCIP:

ENIMultiIP is a type of eni network card with multiple IP. The ResourceManager in networkService assigns IP addresses in its own IP address pool.

VPCENI creates and mounts an eni for Pod. The allocateENI in networkService initiates the eni creation and mount of its ecs instance to Aliyun Openapi, and obtains the corresponding eni IP address.

VPCIP assigns an IP address to Pod on the VPC network plane, which is the IP address obtained from the vpc control plane by calling the ipam interface in the plug-in

After backendServer returns the result of the AllocIP call (IP address), the plug-in calls the NetnsDriver``Setup interface implementation under different network types to complete the link setting from the container Nic to the host Nic, where:

VethDriver

RawNicDriver (mainly implements VPC flat network routing settings, including default routing and gateway configuration)

Create veth pair

Add IP addr for container interface

Add routes

Host side namespace config

Add host routes and rules

Both ENIMultiIP and VPCIP adopt the link mode of vethDriver, and the steps include:

VPCENI is slightly different in that it binds an eni on the VPC plane for each Pod, including two NetnsDriver API calls:

To sum up, the figure shows:

Why do you need to support the above three network types? It is fundamentally determined by Aliyun's vpc network infrastructure, while covering the usage scenario requirements of Aliyun's mainstream applications for vpc network resources. On the other hand, the container network solution of standard Amazon AWS can support the same function on the network facilities based on VPC and ENI.

The main difference between ENI multi-IP, VPC ENI and VPC IP is that the Pod network segment and VPC network segment under the former two are the same, while the network segment of VPC IP and the host network segment of the node are different. This makes the IP routing in the ENI network environment completely on the L2 network plane of VPC, while the VPC IP network needs to configure the next-hop host of the Pod network segment in the VPC routing table, which is similar to the route pattern of Flannel. It can be seen that ENI networks can bring more flexible routing choices and better routing performance. The following two screenshots reflect their different routing characteristics:

VPC ENI Network:

VPC IP Network:

There are two routing patterns in ENI multi-IP (1 primary IP/ and multiple secondary IP) networks: veth policy routing and ipvlan. The essential difference between the two lies in the use of different route patterns, the former using veth pair policy routing, the latter using ipvlan network routing. Policy routing needs to configure policy routing entries on the node to ensure that the traffic of the secondary IP passes through the elastic network card to which it belongs. Ipvlan realizes that a network card virtualizes multiple sub-network cards and different IP addresses, and eni binds its auxiliary IP to these virtual sub-network cards to form an L3 network connected with the vpc plane. This mode makes the network structure of ENI multi-IP relatively simple, and its performance is better than that of veth policy routing network. Switching between the two network modes can be done through configuration (the default is vethpair):

It is worth mentioning that Terway also implements the management and allocation mechanism of ENI multi-IP address resource pool. The eniIPFactory in networkService creates a goroutine for each eni network card, and the allocation and release of the eniIP on the eni network card is done in this goroutine. Scan the existing eni Nic when creating an eniIP, for example, if there is also a free eniIP in the eni, the goroutine will return an assigned IP to the eniIPFactory through ipResultChan. If the eniIP of all eni network cards is assigned, a new eni network card and the corresponding goroutine will be created first. When you create an eni network card for the first time, you do not need to make an IP assignment, and you can directly return to the eni network card master IP. The eniIP release is reversed, and when the last eniIP of the eni Nic is released, the entire eni Nic resources will be released.

In addition, a startCheckIdleTickergoroutine periodically scans the MaxPoolSize and MinPoolSize water levels of the address pool, and creates and releases the address pool eniIP resources when they are below or above the water level threshold, so that the address pool IP resources are in a controllable water level range. In order to ensure the consistency of the resource state, a startGarbageCollectionLoopgoroutine periodically scans the IP address for active or expired status, such as the resource GC operation. Finally, the Pod resource state data is persisted in a local boltDB file / var/lib/cni/terway/pod.db, and even if the Pod has been deleted in the apiServer, GetPod reads the copy data from the local boltDB. In cases where the Pod has been deleted but the copy still exists DB, the GC goroutine detects that the cleanup is performed. Screenshot description:

At this point, I believe you have a deeper understanding of "how to implement the Kubernetes container network model". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.