What if a Kubernetes attach/detach controller logic loophole causes pod to fail to start? 04/18 Update SLTechnology News&Howtos

What if a Kubernetes attach/detach controller logic loophole causes pod to fail to start?

2025-04-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

In this issue, the editor will bring you about the Kubernetes attach/detach controller logic loopholes that lead to the failure of pod startup. The article is rich in content and analyzes and describes for you from a professional point of view. I hope you can get something after reading this article.

Preface

Through in-depth study of k8s attach/detach controller source code, understand the reasons for the occurrence of attach/detach controller bug found in the current network case, and give a solution.

Current network case phenomenon

We first understand the problems and phenomena of the existing network cases; then we deeply understand the data structure maintained by ad controller; then according to the data structure and the code logic of ad controller, we analyze the causes and solutions of the existing network cases in detail. Thus in-depth understanding of the whole ad controller.

Problem description

A statefulsets (sts) refers to multiple pvc cbs. When we update the sts, delete the old pod and create a new pod. If the cbs detach fails to delete the old pod, and the new pod created is scheduled to the same node as the old pod, these pod may be in ContainerCreating all the time.

Phenomenon

Kubectl describe pod

Kubelet log

VolumesAttached and volumesInUse of kubectl get node xxx-oyaml

VolumesAttached:-devicePath: / dev/disk/by-id/virtio-disk-6w87j3wv name: kubernetes.io/qcloud-cbs/disk-6w87j3wvvolumesInUse:-kubernetes.io/qcloud-cbs/disk-6w87j3wv-kubernetes.io/qcloud-cbs/disk-7bfqsft5k8s Storage brief

Attach/detach controller in K8s is responsible for storing the attach/detach of the plug-in. This paper analyzes the source logic of ad controller with a case of existing network, which is the failure of pod creation caused by ad controller bug of K8s.

The main components involved in storage in K8s are: attach/detach controller, pv controller, volume manager, volume plugins, scheduler. Each component has a clear division of labor:

Attach/detach controller: responsible for attach/detach volume

Pv controller: responsible for handling pv/pvc objects, including pv's provision/delete (cbs intree's provisioner is designed as external provisioner, and an independent cbs-provisioner is responsible for cbs pv's provision/delete)

Volume manager: mainly responsible for mount/unmount volume

Volume plugins: contains storage plug-ins native to K8s and from various vendors

Native ones include: emptydir, hostpath, flexvolume, csi, etc.

Various manufacturers include: aws-ebs, azure, our cbs, etc.

Scheduler: involves scheduling of volume. For example, the predicate policy for the maximum number of attach disks per node for ebs, csi, etc.

Controller mode is a very important concept of K8s. Generally, a controller will manage one or more API objects to make the objects approach the desired state from the actual state / current state.

So the function of attach/detach controller is to go to attach to be expected by volume,detach of attach to be expected by volume of detach.

The subsequent attach/detach controller is referred to as ad controller.

Ad controller data structure

For ad controller, understanding its internal data structure and then understanding logic will get twice the result with half the effort. Ad controller maintains two data structures in memory:

ActualStateOfWorld-represents the actual state (hereinafter referred to as asw)

DesiredStateOfWorld-represents the expected state (hereinafter referred to as dsw)

Obviously, for declarative API, it is necessary to compare the actual state and the expected state at any time, so two data structures are used in ad controller to represent the actual state and the expected state respectively.

ActualStateOfWorld

ActualStateOfWorld contains 2 map:

AttachedVolumes: contains those volumes that ad controller believes have been successfully attach to nodes

NodesToUpdateStatusFor: contains the nodes to update the node.Status.VolumesAttached

AttachedVolumes

How to populate the data?

1. When starting ad controller, it will populate asw, and then all node objects in the cluster will be list, and then the node.Status.VolumesAttached of these node objects will be used to populate attachedVolumes.

2. After that, whenever the volume that needs attach is successfully attach, MarkVolumeAsAttached (in GenerateAttachVolumeFunc) will be called to populate the attachedVolumes.

How do I delete data?

1. Only when the volume is successfully detach will the relevant volume be deleted from the attachedVolumes. (call MarkVolumeDetached in GenerateDetachVolumeFunc)

NodesToUpdateStatusFor

How to populate the data?

1. If detach volume fails, transfer volume add back to nodesToUpdateStatusFor

-call AddVolumeToReportAsAttached in GenerateDetachVolumeFunc

How do I delete data?

1. RemoveVolumeFromReportAsAttached will be called before detach volume to delete the volume-related information from nodesToUpdateStatusFor.

DesiredStateOfWorld

A map is maintained in desiredStateOfWorld:

NodesManaged: contains the nodes managed by ad controller and the volumes that expects the attach to be on these node.

NodesManaged

How to populate the data?

1. When ad controller is started, all node objects in the cluster are populate asw,list, and then the node managed by ad controller is populated into nodesManaged

2. If there is an update from nodeInformer watch to node of ad controller, it will also populate node to nodesManaged.

3. When there is a change from populate dsw and podInformer watch to pod (add, update), fill nodesManaged with the information of volume and pod

4. DesiredStateOfWorldPopulator will periodically find out the pod that needs to be add, and at this time, the corresponding volume and pod will be populated to the nodesManaged.

How do I delete data?

1. When you delete node, the nodeInformer watch to change in ad controller will delete the corresponding node from the nodesManaged of dsw.

2. When the podInformer watch to pod in ad controller is deleted, the corresponding volume and pod will be deleted from nodesManaged.

3. The pod to be deleted will be found periodically in the desiredStateOfWorldPopulator, and the corresponding volume and pod will also be deleted from the nodesManaged.

Brief introduction of ad controller process

The logic of ad controller is relatively simple:

1. First, all the node and pod in the list cluster, populate actualStateOfWorld (attachedVolumes) and desiredStateOfWorld (nodesManaged)

2. Then, a separate goroutine is opened to run reconciler, and the detach operation periodically removes reconcile asw (actual state) and dws (expected state) by triggering attach.

Trigger the attach,detach operation, that is, the volume,attach of the detach should be detach and the volume of attach

3. After that, open a separate goroutine to run DesiredStateOfWorldPopulator, periodically verify whether the pods in dsw still exists, and delete it from dsw if it does not exist

Current network case

Next, combined with the current network case mentioned above, let's take a detailed look at the logic of reconciler.

Preliminary analysis of the case

You can see from the pod event that ad controller thought cbs attach was successful, and then kubelet was not as successful as mount.

But from the kubelet log, it is found that Volume not attached according to node status, that is, kubelet believes that cbs is not mounted according to the status of node. This can also be confirmed from node info: there is no such CBS disk (disk-7bfqsft5) in volumesAttached.

There is another phenomenon in node info: there is also this cbs in volumesInUse. It indicates that there is no unmount success.

Obviously, for cbs to be used successfully by pod, ad controller and volume manager need to work together. So the positioning of this issue should be clear first of all:

Why does volume manager think that volume is not mounted according to node status, while ad controller thinks volume attch is successful?

What role do volumesAttached and volumesInUse play between ad controller and kubelet?

Here is only a brief analysis of the analysis volume manager.

According to the Volume not attached according to node status to find the corresponding location in the code, found in the GenerateVerifyControllerAttachedVolumeFunc. If you look at the code logic carefully, you will find that

At this point, the volumes to be mount (podsToMount of volumesToMount) will be obtained from the dsw cache of volume manager.

Then traverse to verify that each volumeToMount is already attach

In the verification logic, the node.Status.VolumesAttached of this node is traversed in GenerateVerifyControllerAttachedVolumeFunc, and an error is reported if it is not found (Volume not attached according to node status).

This volumeToMount is added to the corresponding memory by the podInformer in the podManager, and then the desiredStateOfWorldPopulator is periodically synchronized to the dsw.

The reconciler of volume manager will first confirm that the volume of the unmount is dropped by unmount.

Then confirm that the volume that should be mount is mount

So it can be seen that volume manager is based on whether the volume exists in the node.Status.VolumesAttached to judge whether the volume has been attach success.

So who's going to fill node.Status.VolumesAttached? Ad controller's data structure, nodesToUpdateStatusFor, is used to store data to be updated to node.Status.VolumesAttached.

Therefore, if the ad controller does not update the node.Status.VolumesAttached, and the new pod,desiredStateOfWorldPopulator synchronizes the volume referenced by the newly created pod into the volumesToMount from the memory in the podManager, an error will be reported when verifying whether the volume is attach.

Of course, later, because WaitForAttachAndMount will be called in the syncLoop of kublet to wait for volumeattach and mount to succeed, because it has not been successful before, waiting for timeout, there will be an error report of meeting timeout expired.

So next we mainly need to see why ad controller didn't update node.Status.VolumesAttached.

Detailed explanation of reconciler of ad controller

Next, analyze the logic of ad controller in detail to see why node.Status.VolumesAttached is not updated, but from the event, ad controller thinks that volume has been mounted successfully.

From the brief description of the process, it can be seen that the main logic of ad controller is in reconciler.

Reconciler runs reconciliationLoopFunc regularly, with a cycle of 100ms.

The main logic of reconciliationLoopFunc is in reconcile ():

Traverse the nodesManaged of dsw to determine whether volume has been attach to the node, and if it has been attach to the node, skip the attach operation

Go to asw.attachedVolumes to determine whether it exists. If it does not exist, it is considered that there is no attach to node.

AttachedConfirmed is set by AddVolumeNode in asw, and MarkVolumeAsAttached is set to true. (true means that the volume has been attach to the node.)

If it exists, it is judged that the node,node also matches and attachedConfirmed is returned.

After that, it is decided whether to disable multiple mounts, and then operator_excutor executes the attach.

Traverse the attachedVolumes in asw, and for each volume, determine whether it exists in dsw

If volume exists in asw and does not exist in dsw, it means that detach is required

After that, according to the node.Status.VolumesInUse to determine whether the volume has completed the unmount, the unmount completes or waits for the 6min timeout time to arrive, the detach logic will continue.

Before executing detach volume, RemoveVolumeFromReportAsAttached is called to delete the volume to be detach from the nodesToUpdateStatusFor of asw

Then patch node, which is equivalent to deleting the volume from node.status.VolumesAttached.

After that, there are mainly two kinds of detach,detach failures.

The backoff cycle begins with 500ms, and then the index increases to 2min2s. For volume whose detach has failed, entering the detach logic during each cycle will directly return backoffError

Judge whether node exists according to nodeName going to dsw.nodesManaged.

If it exists, then judge whether volume exists according to volumeName.

If the actual implementation of volumePlugin fails to implement DetachVolume, volume add back will be passed to nodesToUpdateStatusFor (and then patch node again after the end of the attach logic)

If operator_excutor determines that it has not reached the backoff cycle, it will return backoffError and skip DetachVolume directly.

First, make sure that the volume of the detach is dropped by detach.

After that, make sure that the volume of the attach is successful by attach

Finally, UpdateNodeStatuses updates the node status

Detailed case analysis

Premise

Volume detach failed

Sts+cbs (pvc), dispatched to the same node before and after pod recreate

Involving k8s components

Ad controller

Kubelet (volume namager)

Ad controller and kubelet (volume namager) interact with each other through the field node.status.VolumesAttached.

Ad controller adds or deletes volume for node.status.VolumesAttached. The addition indicates that it has been mounted, and the deletion indicates deletion.

Kubelet (volume manager) needs to verify whether the (pvc's) volume in the new pod has been mounted successfully. If it exists in the node.status.VolumesAttached, it means that the volume has been mounted successfully; if it does not exist, it has not been mounted successfully.

The following is the whole process:

First, when you delete pod, the cbs detach fails for some reason, and if it fails, backoff will try again.

Because the detach failed, the volume will not be deleted from the attachedVolumes of asw

Because of the detach

Remove volume from node.status.VolumesAttached before executing detach

Returning backoffError when detach does not set the volumeadd back node.status.VolumesAttached

After that, we re-create the sts,pod in the backoff cycle (if it is in the middle of the 500ms of the first cycle) to the node before it is scheduled

Once pod is created, it is added to dsw's nodesManaged (nodeName and volumeName are unchanged)

Step 2 in reconcile () will determine whether the volume is attach. At this time, it is found that the volume exists in both asw and dws, and because the detach fails, it will also be found to be attach during detection, thus setting the attachedConfirmed to true.

Ad controller thinks that the volume has been successful by attach.

When judging by the detach logic in step 1 of reconcile (), it is found that the volume of the detach already exists in the dsw.nodesManaged (because both nodeName and volumeName have not changed), so the volume exists in both the asw and the dsw, and the actual state is consistent with the expected state, so it is considered that there is no need for detach.

In this way, the volume will never be add back to node.status.VolumesAttached again. So there is a phenomenon that there is no such volume in the node info, and ad controller thinks that the volume has been succeeded by attach.

Because kubelet (volume manager) and controller manager are asynchronous and the interaction between them is based on node.status.VolumesAttached, volume manager verifies whether volume is attach successful, and finds that there is no such voume in node.status.VolumesAttached, that is, it is considered that there is no attach success, so there is an error Volume not attached according to node status in the phenomenon.

Kubelet's syncPod then timed out while waiting for all pod's volume attach and mount to succeed (another error timeout expired wating... in the phenomenon).

So pod has been in the ContainerCreating.

Summary

Therefore, the reason for this case is:

Sts+cbs,pod recreate time is scheduled to the same node

Due to the failure of detach, a sts/pod is created during backoff, resulting in the consistency of dsw and asw data in ad controller (at this time, the volume is indeed in attach state because it is not successful by detach), which causes ad controller to think that it is no longer necessary to detach the volume.

And because when detach, the volume is deleted from the node.status.VolumesAttached first, and then the real DetachVolume is executed. Return backoffError directly during backoff, skip DetachVolume, and do not add back

After that, because the volume is already in the attach state, ad controller believes that it no longer needs to be attach, so the volume will not be added to the node.status.VolumesAttached

Finally, kubelet interacts with ad controller through node.status.VolumesAttached, so kubelet believes that without the success of attach, the newly created pod has been in ContianerCreating.

Based on this, we can find that the key point lies in node.status.VolumesAttached and the following two logic:

BackoffError when detach, not add back

Detach is deleted first, then add back if it fails.

So as long as you find a way to add back under any circumstances, there will be no problem. According to the above two logic, solution 2 is recommended for the following two solutions:

This scheme avoids the problems of scenario 1 and further reduces the number of apiserver requests with few changes.

Pr # 88572

But this approach has a drawback: the number of requests for patch node has increased by 10 + times / (s * volume)

Pr # 72914

As soon as you enter the detach logic, you will determine whether it is backoffError (in the backoff cycle). If you skip all the detach logic after that, you will not need add back if you do not delete it.

When backoffError, also add back

Summary

AD Controller is responsible for storing Attach and Detach. Determine if attach/detach is needed by comparing asw with dsw. The final attach and detach results will be reflected in node.status.VolumesAttached.

The phenomenon of the above existing network cases is caused by the bug of K8s ad controller, which has not been repaired in the community at present.

When the detach fails in the process of deleting the old pod, and a new pod is created in the backoff cycle in which the detach fails, the volume is deleted from the node.status.VolumesAttached because of the ad controller logic bug. As a result, when the new pod is created, the kubelet check considers that the volume is not as successful as the attach, resulting in the pod being in ContianerCreating all the time.

The main reasons for the phenomenon are:

For the solution to the phenomenon, pr # 88572 is recommended. At present, TKE already has a stable running version of the scheme, in grayscale.

The above is what to do if the Kubernetes attach/detach controller logic loophole shared by the editor leads to the failure of pod startup. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.