Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Kubernetes source code probing: Pod IP leak troubleshooting and resolution

2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)06/02 Report--

UK8S is a Kubernetes container cloud product launched by UCloud, which is fully compatible with native API and provides users with one-stop Kubernetes services on the cloud. Our team developed the CNI (Container Network Interface) network plug-in, which deeply integrates VPC, so that UK8S container applications have the same network performance as CVM (currently up to 10Gb/s, 1 million pps), and open the network of container and physical cloud / managed cloud. In the process, we solved the problem that open source kubelet created redundant Sandbox containers that led to the inexplicable disappearance of Pod IP, ensured that the CNI plug-in was running properly, and was ready to submit the repaired kubelet source code to the community.

The Network solution of deeply integrating VPC

According to our idea, developers can deploy, manage and expand containerized applications on UK8S without paying attention to the operation and maintenance work such as the construction and maintenance of Kubernetes clusters. UK8S is fully compatible with native Kubernetes API, based on UCloud public cloud resources, and integrates public cloud networks and storage products such as ULB, UDisk and EIP through self-developed plug-ins to provide users with one-stop Kubernetes services on cloud.

Among them, VPC not only ensures network isolation, but also provides flexible IP address definition, which is one of the necessary needs of users for the network. After investigation, the UK8S R & D team believes that the UCloud basic network platform has a native and strong underlying network control capability, which enables us to put aside the Overlay scheme and move the capabilities of VPC up to the container layer to achieve control and forwarding through the capabilities of VPC. Each time UK8S creates a Pod, it requests a VPC IP for it and configures it to the Pod through VethPair, and then configures policy routing. The principle is shown in the following figure.

Cdn.xitu.io/2019/4/12/16a108f5ce1ae05e?w=640&h=329&f=jpeg&s=18936 ">

This scenario has the following advantages:

No Overlay, high network performance. The test data under 50 Node show that the network performance between container and container is only slightly different from that between CVM and CVM (in packet scenario, pps will cause 33.5% loss), and the network performance metrics (throughput, packet volume, delay, etc.) of Pod will not be reduced with the increase of node size. On the other hand, Flannel UDP,VXLan mode and Calico IPIP mode have obvious performance consumption.

Pod provides direct access to both public and physical clouds. For users who use public cloud and physical cloud, K8S has one less obstacle and more convenience in business. In Flannel's host gw mode, containers cannot access public and physical CVMs.

The workflow of CNI is as follows.

The process of creating a Pod network:

Delete the Pod network process:

Investigation and solution to the problem of disappearance of Pod IP

In order to test the stability of the CNI plug-in, the test students deployed a CronJob on UK8S, running one Job task per minute and 1440 tasks a day. The CronJob is defined as follows:

ApiVersion: batch/v1beta1 kind: CronJob metadata: name: hello spec: schedule: "/ 1 *" jobTemplate: spec: template: spec: containers:-name: hello image: busybox args:-/ bin/sh-- c-date; echo Hello from the Kubernetes cluster restartPolicy: OnFailure

Each time you run Job, you create a Pod, each time you create a Pod,CNI plug-in, you apply for a VPC IP, and when the Pod is destroyed, the CNI plug-in needs to release the VPC IP. Therefore, in theory, 1440 request VPC IP and release VPC IP operations need to be done every day through this CronJob.

However, after several days of test statistics, it is found that through the CronJob, the cluster applies for IP more than 2500 times a day, and the number of IP released also reaches 1800. Both requests and releases have exceeded 1440, and the number of requests has exceeded the number of releases, meaning that some of the VPC IP allocated to Pod has been invalidated and disappeared.

CNI: where is the IP to be deleted?

After a careful analysis of the running log of the CNI plug-in, it was soon found that there were many cases in which CNI could not find Pod IP when performing the action of dismantling the SandBox network (CNI_COMMAND=DEL). Because the CNI lookup Pod IP developed by UK8S relies on the correct Pod network namespace path (format: / proc/10001/net/ns), and the NETNS environment variable parameter passed by kubelet to CNI is an empty string, CNI cannot obtain the VPC IP to be released, which is the direct cause of IP leakage, as shown in the following figure.

The question turns to kubelet, why does kubelet pass an empty CNI_NETNS environment variable parameter to the CNI plug-in?

The running log of kubelet is then tracked, and it is found that when many Job Pod are created and destroyed, an additional Sandbox container is generated. The Sandbox container is the Infra container in K8s pod. It is the first container created in Pod. It is used to create the network namespace of Pod and initialize the Pod network, such as calling CNI to assign Pod IP, issuing policy routing, and so on. It executes a process called pause, which spends most of its time in the Sleep state and consumes very little system resources. Oddly enough, when the task container busybox finished running, kubelet created a new Sandbox container for Pod, which naturally made another CNI ADD call and applied for VPC IP again.

Back at UK8S CNI, we analyze and reproduce the case log again. This time it is further discovered that all cases in which kubelet is passed to an empty string in the NETNS parameter occur during kubelet's attempt to destroy the second Sandbox in the Pod. Conversely, when kubelet tries to destroy the second Sandbox, the NETNS parameters passed to CNI are all empty strings.

At this point, it seems clear that all leaked VPC IP comes from a second Sandbox container. Therefore, we need to find out two questions:

Why is there a second Sandbox container?

Why did kubelet pass an incorrect NETNS parameter to CNI when destroying the second Sandbox container?

The second Sandbox: why am I born?

Before you know the past life and this life of the second Sandbox, you need to explain the basic principles and processes of kubelet operation.

Kubelet is the worker process of the Node node in the kubernetes cluster. When a Pod is successfully dispatched to the Node node by kube-sheduler, kubelet is responsible for creating the Pod and starting the containers it defines. Kubelet also works in controller mode, and its core is a control loop, called syncLoop in the source code, which pays attention to and handles the following events:

Pod update event, derived from API Server

Pod life cycle (PLEG) changes, resulting from changes in the container state of Pod itself, such as the creation, start, and end of the container

Periodic synchronization (Sync) tasks set by kubelet itself

Pod Survival Detection (LivenessProbe) failure event

Scheduled cleanup events (HouseKeeping).

In the CronJob task described above, a Pod is created each time the Job task is run. In the life cycle of this Pod, ideally, you need to experience the following important events:

Pod is successfully dispatched to a work node, and the Kubelet on the node senses the creation of Pod events through Watch APIServer, and begins to create the Pod process

Kubelet creates a Sandbox container for Pod, which is used to create Pod network namespaces and call the CNI plug-in to initialize the Pod network. After the Sandbox container starts, the first kubelet PLEG (Pod Life Event Generator) event is triggered.

The main container is created and started, triggering the second PLEG event.

The main container date command ends, the container terminates, and the third PLEG event is triggered.

Kubelet kills the remaining Sandbox containers in Pod.

The Sandbox container is killed, triggering the fourth PLEG event.

Among them, 3 and 4 may be merged into the same PLEG event because of the short interval (kubelet updates the PLEG event every 1 s).

However, in all the VPC IP leaks we have observed, the second Sandbox container for Pod is created "accidentally" after process 6, as shown in the lower-right corner of the following figure. In our understanding of Kubernetes, this should not happen.

Draw and peel cocoons from kubelet source code (1.13.1)

As mentioned earlier, the syncLoop loop listens for changes in PLEG events and handles them. The PLEG event, on the other hand, is a pleg relist scheduled task within the source kubelet. Kubelet performs relist operations every second to obtain container creation, startup, container, and delete events in a timely manner.

The main responsibility of relist is to obtain the real-time status of all containers in Pod through CRI. Here containers are divided into two categories: Sandbox containers and non-Sandbox containers. Kubelet identifies them by typing different label to the containers. CRI is a unified gRPC interface for container operation. All kubelet operations on containers are completed through CRI requests, while container projects such as Docker,Rkt are responsible for implementing their own CRI implementations. The implementation of Docker is that dockershim,dockershim is responsible for extracting received CRI requests, translating them into Docker API and sending them to Docker Daemon.

Relist updates the latest status of Sandbox containers and non-Sandbox containers in Pod through CRI requests, and then writes the status information to the cache podCache of kubelet, and notifies the syncLoop loop through pleg channel if the state of the container changes. For a single pod,podCache, two arrays are assigned to hold the latest state of the Sandbox container and the non-Sandbox container.

After receiving the event from pleg channel, syncLoop enters the corresponding sync synchronization process. For the PLEG event, the corresponding handler is HandlePodSyncs. This function opens a new pod worker goroutine, gets the latest podCache information about pod, and then enters the real synchronization operation: the syncPod function.

SyncPod converts the latest pod status information (podStatus) in podCache into a Kubernetes API PodStatus structure. It is worth mentioning that syncPod calculates the state of Pod (getPhase function) based on the state of each container in podCache, such as Running,Failed or Completed. Then enter the Pod container runtime synchronization operation: the SyncPod function, which synchronizes the current container state with the SPEC expected state defined by Pod API. The following source code flow chart can summarize the above process.

SyncPod: what did I do wrong?

SyncPod first calculates the current state of all containers in the Pod to compare and synchronize with the expected state of the Pod API. This comparison synchronization is divided into two parts:

Check that the state of the Sandbox container in podCache meets this condition: there is only one and only Sandbox container in Pod, and the container is running and has IP. If it is not satisfied, it is considered that the Pod needs to rebuild the Sandbox container. If the Sandbox container needs to be rebuilt, all containers in the Pod need to be destroyed and rebuilt.

Check the running status of non-Sandbox containers in podCache to make sure they are in the desired state of Pod API Spec. For example, if a container main process is found to exit and the return code is not 0, it is decided whether to rebuild the container based on the RestartPolicy in Pod API Spec.

Recall the key clue mentioned earlier: all VPC IP leaks originate from an unexpected Sandbox container, and the leaked IP is the IP of this Sandbox container. As mentioned just now, the SyncPod function determines whether Pod needs to rebuild the Sandbox container. Does this unexpected second Sandbox container have anything to do with this decision? This conjecture cannot be confirmed by the running log of kubelet, and the source code must be modified to increase the log output. After recompiling kubelet, it is found that the second Sandbox container does come from the decision result in the SyncPod function. It is further confirmed that the SyncPod call is triggered by the PLEG caused by the killing of the first Sandbox container by kubelet.

So why does SyncPod think that Pod needs to rebuild the Sandbox container after the first Sandbox container is destroyed? Enter the decision function podSandboxChanged to analyze it carefully.

PodSandboxChanged obtained the Sandbox container structure instance in podCache and found that the first Sandbox has been terminated and is in a NOT READY state. Therefore, it is considered that there is no available Sandbox container in pod and needs to be rebuilt. The source code is shown below.

Notice the CronJob yaml configuration we located earlier in this article, where the restartPolicy in the Job template is set to OnFailure. After SyncPod completes the status check of the Sandbox container, he thinks that the Pod needs to rebuild the Sandbox container. After checking that the restartPolicy of the Pod is OnFailure again, he decides to rebuild the Sandbox container. The corresponding source code is as follows.

You can see that during the SyncPod operation triggered after the death of the first Sandbox container, kubelet simply found that the only Sandbox container was in NOT READY state, and thought that Pod needed to rebuild the Sandbox, ignoring the fact that the main container of Job had ended successfully.

In fact, in the process of calculating the API PodStatus Phase through podCache in the previous syncPod function, kubelet already knows that the Pod is in the Completed state and is stored in the apiPodStatus variable and passed to the SyncPod function as an argument. This is shown in the following figure.

Job has entered the Completed state and the Sandbox container should not be rebuilt at this time. When determining whether the Sandbox needs to be rebuilt, the SyncPod function does not refer to the apiPodStatus parameter passed in by the caller syncPod, or even this parameter is ignored.

The source of the second Sandbox container has been figured out, and the solution is very simple, that is, kubelet does not create a Sandbox for Pod that is already Completed, as shown below.

After the kubelet was recompiled and updated, the VPC IP leak was resolved.

The following figure summarizes the reasons for the birth of the second Sandbox container described above.

There is still some way to go before the truth comes out. There is one more question to answer:

Why did kubelet pass in incorrect NETNS environment variable parameters when deleting the second Sandbox container and calling CNI to dismantle the container network?

Lost NETNS

Remember the periodic cleanup event (HouseKeeping) mentioned earlier when I introduced syncLoop, the core cycle of kubelet work? HouseKeeping is a scheduled task that runs every 2s and is responsible for scanning and cleaning the orphan Pod, deleting its residual Volume directory, and stopping the Pod worker goroutine to which the Pod belongs. After HouseKeeping discovers that the Job Pod has entered the Completed state, it will find out if there are any residual containers running on the Pod. If so, please ignore it. Because the second Sandbox container is still running, HouseKeeping cleans it. One of the steps is to clean up the cgroup to which the Pod belongs, killing all processes in the group, so that the pause process in the second Sandbox container is killed and the container exits.

The second dead Sandbox container will be taken over by the garbage collection cycle in kubelet, and it will be completely stopped for destruction. However, because the previous Housekeeping operation has destroyed the cgroup of the container, the network namespace no longer exists, so when calling the CNI plug-in to dismantle the Sandbox network, kubelet cannot get the correct NETNS parameter and can only pass in an empty string.

At this point, the cause of the problem has been identified.

Problem solving

When everything was clear, we began to solve the problem. In order to ensure that the VPC IP,CNI corresponding to the deleted Pod is found, the corresponding associated information such as the PodName,Sandbox container ID,NameSpace,VPC IP needs to be stored after the ADD operation is successful. In this way, when you enter the DEL operation, you only need to find the VPC IP through the PodName,Sandbox containers ID and NameSpace passed in the kubelet, and then delete it through the UCloud public cloud related API, without relying on the NETNS operation.

Considering that the root cause of the problem is the SyncPod function that appears in the kubelet source code, the UK8S team has also fixed the kubelet source code and is ready to submit patch to the Kubernetes community.

Write at the end

Kubernetes is still an open source project in high-speed iteration, and it will not be available in the production environment to avoid some anomalies. While learning to understand the operation principle of each component of Kubernetes, the UK8S R & D team actively goes deep into the source code according to the abnormal phenomena in the existing network to gradually explore the root causes of the problems, further ensure the stability and reliability of UK8S services, and enhance the product experience.

UK8S will also support a series of features such as node auto scaling (Cluster AutoScaler), physical machine resources, GPU resources, hybrid cloud and ServiceMesh in 2019.

Welcome to scan the QR code below, join the UCloud K8S technology exchange group, and discuss Kubernetes cutting-edge technology with us.

If the group size is full, you can add the group owner Wechat zhaoqi628543 and remark K8S to invite you to join the group.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report