Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Understand K8s persistent storage process in one article

2025-10-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)06/02 Report--

Author | Sun Zhiheng (Hui Zhi) Alibaba Development engineer

Introduction: as we all know, the persistent storage (Persistent Storage) of K8s ensures that the application data exists independently of the application life cycle, but its internal implementation is rarely mentioned. What exactly is the storage process in K8s? What about the invocation relationship among PV, PVC, StorageClass, Kubelet, CSI plug-ins, and so on? these mysteries will be revealed one by one in this article.

K8s persistent storage base

Before explaining the K8s storage process, let's review the basic concepts of persistent storage in K8s.

1. Noun interpretation

In-tree: code logic in the official warehouse of K8s

Out-of-tree: code logic is decoupled from K8s code outside the official warehouse of K8s.

PV:PersistentVolume, a resource at the cluster level, is created by the cluster administrator or External Provisioner. The life cycle of PV is independent of the storage device details in the .Spec of Pod,PV using PV.

PVC:PersistentVolumeClaim, a resource at the namespace level, created by the user or StatefulSet controller (based on VolumeClaimTemplate). PVC is similar to Pod,Pod consuming Node resources, while PVC consumes PV resources. Pod can request specific levels of resources (CPU and memory), while PVC can request the size and access mode of specific storage volumes (Access Mode)

StorageClass:StorageClass is a cluster-level resource created by a cluster administrator. SC provides administrators with a "class" template for dynamically providing storage volumes. Different quality of service levels, backup strategies and so on of storage volume PV are defined in detail in .Spec in SC.

The purpose of CSI:Container Storage Interface is to define the industry-standard "container storage interface" so that plug-ins developed by storage vendors (SP) based on CSI standards can work in different container orchestration (CO) systems, including Kubernetes, Mesos, Swarm and so on.

two。 Component introduction

PV Controller: responsible for PV/PVC binding and cycle management, and Provision/Delete operation of data volumes according to requirements

AD Controller: responsible for the Attach/Detach operation of the data volume, connecting the device to the target node

Kubelet:Kubelet is the main "node agent" running on each Node node, and its functions are Pod life cycle management, container health check, container monitoring, etc.

A component in Volume Manager:Kubelet that manages the Mount/Umount operation of the data volume (also responsible for the Attach/Detach operation of the data volume, which needs to be configured with kubelet-related parameters to enable this feature), formatting of volume devices, etc.

Volume Plugins: storage plug-in, developed by storage vendors, aims to extend the volume management capabilities of various storage types and achieve various operational capabilities of third-party storage, that is, the implementation of the blue operation above. There are two kinds of Volume Plugins: in-tree and out-of-tree

External Provioner:External Provioner is a sidecar container that calls the CreateVolume and DeleteVolume functions in Volume Plugins to perform Provision/Delete operations. Because the PV controller of K8s cannot directly call the related functions of Volume Plugins, it is called by External Provioner through gRPC.

External Attacher:External Attacher is a sidecar container that calls the ControllerPublishVolume and ControllerUnpublishVolume functions in Volume Plugins to perform Attach/Detach operations. Because the AD controller of K8s cannot directly call the related functions of Volume Plugins, it is called by External Attacher through gRPC.

3. Persistent volume usage

Kubernetes introduces PV and PVC to enable applications and their developers to request storage resources normally and avoid dealing with the details of storage facilities. There are two ways to create a PV:

One is the PV needed by cluster administrators to create applications statically by manual means.

The other is that the user creates the PVC manually and the corresponding PV is dynamically created by the Provisioner component.

Let's take NFS shared storage as an example to see the difference between the two.

Create storage volumes statically

The process of statically creating a storage volume is shown in the following figure:

Step 1: the cluster administrator creates a NFS PV,NFS that belongs to the in-tree storage type natively supported by K8s. The yaml file is as follows:

ApiVersion: v1kind: PersistentVolumemetadata: name: nfs-pvspec: capacity: storage: 10Gi accessModes:-ReadWriteOnce persistentVolumeReclaimPolicy: Retain nfs: server: 192.168.4.1 path: / nfs_storage

Step 2: the user creates the PVC,yaml file as follows:

ApiVersion: v1kind: PersistentVolumeClaimmetadata: name: nfs-pvcspec: accessModes:-ReadWriteOnce resources: requests: storage: 10Gi

You can see that PV and PVC are bound through the kubectl get pv command:

[root@huizhi ~] # kubectl get pvcNAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGEnfs-pvc Bound nfs-pv-no-affinity 10Gi RWO 4s

Step 3: the user creates the application and uses the PVC created in step 2.

ApiVersion: v1kind: Podmetadata: name: test-nfsspec: containers:-image: nginx:alpine imagePullPolicy: IfNotPresent name: nginx volumeMounts:-mountPath: / data name: nfs-volume volumes:-name: nfs-volume persistentVolumeClaim: claimName: nfs-pvc

At this point, the remote storage of NFS is mounted to the / data directory of the nginx container in Pod.

Create storage volumes dynamically

Dynamic creation of storage volumes requires the deployment of nfs-client-provisioner and corresponding storageclass in the cluster.

Dynamic creation of storage volumes reduces the intervention of cluster administrators compared with static creation of storage volumes. The process is shown below:

The cluster administrator only needs to ensure that there is a NFS-related storageclass in the environment:

Kind: StorageClassapiVersion: storage.k8s.io/v1metadata: name: nfs-scprovisioner: example.com/nfsmountOptions:-vers=4.1

Step 1: the user creates a PVC, where the storageClassName of PVC is specified as the storageclass name of the above NFS:

Kind: PersistentVolumeClaimapiVersion: v1metadata: name: nfs annotations: volume.beta.kubernetes.io/storage-class: "example-nfs" spec: accessModes:-ReadWriteMany resources: requests: storage: 10Mi storageClassName: nfs-sc

Step 2: the nfs-client-provisioner in the cluster will dynamically create the corresponding PV. At this point, you can see that the PV in the environment has been created and bound to PVC.

[root@huizhi ~] # kubectl get pvNAME CAPACITY ACCESSMODES RECLAIMPOLICY STATUS CLAIM REASON AGEpvc-dce84888-7a9d-11e6-b1ee-5254001e0c1b 10Mi RWX Delete Bound default/nfs 4s

Step 3: the user creates the application and uses the PVC created in step 2, which is the same as step 3 in statically creating a storage volume.

K8s persistent storage process 1. Process overview

Learn from @ Junbao's flow chart in the cloud native storage course here.

The process is as follows:

The user created a Pod containing PVC, which requires dynamic storage volumes

According to Pod configuration, node status, PV configuration and other information, Scheduler dispatches Pod to an appropriate Worker node.

The PV controller watch until the PVC used by the Pod is in the Pending state, so call Volume Plugin (in-tree) to create the storage volume and create the PV object (out-of-tree is handled by External Provisioner)

The AD controller finds that Pod and PVC are in a pending state, so it calls Volume Plugin to mount the storage device to the target Worker node.

On the Worker node, the Volume Manager in Kubelet waits for the storage device to be mounted and mounts the device to the global directory through Volume Plugin: / var/lib/kubelet/pods/ [pod uid] / volumes/kubernetes.io~iscsi/ [PV

Name] (take iscsi as an example)

Kubelet starts the Containers of Pod through Docker and maps volumes that have been mounted to the local global catalog to the container in bind mount mode.

A more detailed process is as follows:

two。 Detailed explanation of process

The persistent storage process is slightly different in different K8s versions. This article is based on Kubernetes version 1.14.8.

As can be seen from the above flowchart, the storage volume is divided into three stages from creation to application use: Provision/Delete, Attach/Detach, and Mount/Unmount.

Provisioning volumes

There are two Worker in the PV controller:

ClaimWorker: handles add / update / delete related events of PVC and state transition of PVC; VolumeWorker: responsible for state transition of PV.

PV State Migration (UpdatePVStatus):

The initial state of PV is Available, and when PV is bound to PVC, the state becomes Bound; and PV bound PVC is deleted, the state becomes Released;, when PV recycling policy is Recycled or PV's .Spec.ClaimRef is manually deleted, PV status changes to Available; when PV recycling policy is unknown or Recycle fails or storage volume deletion fails, PV state changes to Failed; to manually delete PV .Spec.ClaimRef, PV status becomes Available.

PVC State Migration (UpdatePVCStatus):

When there is no PV in the cluster that meets the PVC condition, the PVC status is Pending. After PV is bound to PVC, the PVC state changes from Pending to Bound; the PV bound with PVC is deleted in the environment, and the PVC state changes to Lost; and binds to a PV of the same name again, and the PVC state becomes Bound.

The Provisioning process is as follows (here the user is simulated to create a new PVC):

Static Storage Volume flow (FindBestMatch): the PV controller first filters a PV with a status of Available in the environment to match the new PVC.

The DelayBinding:PV controller determines whether the PVC needs delayed binding: 1. Check whether the annotation of PVC contains volume.kubernetes.io/selected-node. If so, it means that the PVC has been assigned a node (belonging to ProvisionVolume) by the scheduler, so there is no need to delay binding. If there is no volume.kubernetes.io/selected-node and no StorageClass in the annotation of PVC, there is no need for delayed binding by default. If there is a StorageClass, check its VolumeBindingMode field. If it is WaitForFirstConsumer, it needs delayed binding. If it is Immediate, it does not need delayed binding.

The FindBestMatchPVForClaim:PV controller attempts to find an existing PV in an environment that meets the PVC requirements. The PV controller filters all the PV at once and selects a best-matched PV from the PV that meets the criteria. Filter rules: 1. Whether VolumeMode matches; 2. Whether PV is bound to PVC; 3. Whether .Status.Phase of PV is Available;4. LabelSelector check that the label of PV and PVC should be consistent; 5. Whether the StorageClass of PV is consistent with that of PVC; 6. Each iteration updates the minimum PV that satisfies PVC requested size and returns as the final result

Bind:PV controller binds the selected PV and PVC: 1. Update the .Spec.ClaimRef information of PV to the current PVC;2. Update the .Status.Phase of PV to Bound;3. Add annotation of PV: pv.kubernetes.io/bound-by-controller: "yes"; 4. Update the .Spec.VolumeName of PVC to the PV name; 5. Update the .Status.Phase of PVC to Bound;6. Add annotation:pv.kubernetes.io/bound-by-controller of PVC: "yes" and pv.kubernetes.io/bind-completed: "yes"

Dynamic storage volume process (ProvisionVolume): if there is no suitable PV in the environment, enter the dynamic Provisioning scenario:

Before Provisioning:1. PV controller first determines whether the StorageClass used by PVC is in-tree or out-of-tree: by checking whether the Provisioner field of StorageClass contains the prefix "kubernetes.io/"; 2. PV controller updates annotation:claim.Annotations ["volume.beta.kubernetes.io/storage-provisioner"] = storageClass.Provisioner of PVC

In-tree Provisioning (internal provisioning): 1. The Provioner of in-tree implements the NewProvisioner method of the ProvisionableVolumePlugin interface, which returns a new Provisioner;2. The PV controller calls the Provision function of Provisioner, which returns a PV object; 3. The PV controller creates the PV object returned in the previous step, binds it to PVC, sets Spec.ClaimRef to PVC,.Status.Phase, sets Bound,.Spec.StorageClassName to the same StorageClassName as PVC Also add annotation: "pv.kubernetes.io/bound-by-controller" = "yes" and "pv.kubernetes.io/provisioned-by" = plugin.GetPluginName ()

Out-of-tree Provisioning (external provisioning): 1. External Provisioner checks whether the claim.Spec.VolumeName in PVC is empty. If not, skip the PVC;2 directly. External Provisioner checks whether the claim.Annotations ["volume.beta.kubernetes.io/storage-provisioner"] in PVC is equal to its own Provisioner Name (External Provisioner passes in the-provisioner parameter to determine its own Provisioner Name when it starts); 3. If PVC's VolumeMode=Block, check whether External Provisioner supports block devices; 4. External Provisioner calls the Provision function: call the CreateVolume interface of the CSI storage plug-in through gRPC; 5. External Provisioner creates a PV to represent the volume, and binds the PV to the previous PVC.

Deleting volumes

The Deleting process is the reverse operation of Provisioning:

The user deletes the PVC and deletes the PV controller to change the PV.Status.Phase to Released.

When PV.Status.Phase = = Released, the PV controller first checks the value of Spec.PersistentVolumeReclaimPolicy, skips it directly when it is Retain, and skips it when it is Delete:

In-tree Deleting:1. In-tree 's Provioner implements the NewDeleter method of the DeletableVolumePlugin interface to return a new Deleter;2. The controller calls the Delete function of Deleter and deletes the corresponding volume;3. After the volume is deleted, the PV controller deletes the PV object

Out-of-tree Deleting:1. External Provisioner calls the Delete function, calls the DeleteVolume interface of the CSI plug-in through gRPC. After volume is deleted, External Provisioner deletes the PV object

Attaching Volumes

Both the Kubelet component and the AD controller can do the attach/detach operation. If-enable-controller-attach-detach is specified in the startup parameters of the Kubelet, the Kubelet will do it; otherwise, it will be controlled by AD by default. Let's take the AD controller as an example to explain the attach/detach operation.

There are two core variables in the AD controller:

DesiredStateOfWorld (DSW): the expected mount state of data volumes in the cluster, including the information of nodes- > volumes- > pods; ActualStateOfWorld (ASW): the actual mount status of data volumes in the cluster, which contains the information of volumes- > nodes.

The Attaching process is as follows:

The AD controller initializes DSW and ASW based on the resource information in the cluster.

There are three components inside the AD controller that periodically update DSW and ASW:

Reconciler . Make sure that the volume is hooked / removed by running a GoRoutine periodically. During this period, ASW is constantly updated:

The Attacher of in-tree attaching:1. In-tree implements the NewAttacher method of the AttachableVolumePlugin interface, which is used to return a new Attacher;2. The AD controller calls the Attach function of Attacher to attach the device; 3. Update ASW.

Out-of-tree attaching:1. Call the CSIAttacher of in-tree to create a VolumeAttachement (VA) object, which contains Attacher information, node name and PV information to be attached. 2. External Attacher will watch the VolumeAttachement resources in the cluster. When you find a data volume that needs to be attached, call the Attach function and call the ControllerPublishVolume API of the CSI plug-in through gRPC.

DesiredStateOfWorldPopulator . Run periodically through a GoRoutine, the main function is to update the DSW:

FindAndRemoveDeletedPods-traverses the Pods in all DSW and removes it from the DSW if it has been deleted from the cluster

FindAndAddActivePods-traverses the Pods in all PodLister and adds it to the DSW if the Pod does not exist in the DSW.

PVC Worker . The add/update event of watch PVC, handles PVC-related Pod, and updates DSW in real time. Detaching Volumes

The Detaching process is as follows:

When the Pod is deleted, the AD controller will watch to the event. First, the AD controller checks whether the Node resource where the Pod is located contains a "volumes.kubernetes.io/keep-terminated-pod-volumes" tag, and if it does not, it does not take any action; if not, the volume is removed from the DSW.

The AD controller approaches the ActualStateOfWorld state to the DesiredStateOfWorld state through Reconciler. When it is found that there is a volume in the ASW that does not exist in the DSW, it will do the Detach operation:

The in-tree detaching:1. AD controller implements the NewDetacher method of the AttachableVolumePlugin interface to return a new Detacher;2. The controller calls the Detach function of Detacher, and the detach updates the ASW corresponding to the volume;3. AD controller.

Out-of-tree detaching:1. AD controller calls the CSIAttacher of in-tree to delete related VolumeAttachement objects; 2. External Attacher will watch the VolumeAttachement (VA) resources in the cluster, and when it finds a data volume that needs to be removed, it calls the Detach function and calls the ControllerUnpublishVolume interface of the CSI plug-in through gRPC. 3. AD controller updates ASW.

There are also two core variables in Volume Manager:

DesiredStateOfWorld (DSW): the expected mount state of the data volume in the cluster, which contains the information of volumes- > pods; ActualStateOfWorld (ASW): the actual mount status of the data volume in the cluster, which contains the information of volumes- > pods.

The Mounting/UnMounting process is as follows:

The purpose of the Global Catalog (global mount path): block devices can only be mounted once on the Linux, while in the K8s scenario, a PV may be mounted to multiple Pod instances on the same Node. If the block device is formatted and mounted to a temporary global catalog on Node, and then the global catalog is mounted to the corresponding directory in Pod using the bind mount technology in Linux, the requirements can be met. In the above flowchart, the global catalog is / var/lib/kubelet/pods/ [pod uid] / volumes/kubernetes.io~iscsi/ [PV

Name]

VolumeManager initializes DSW and ASW based on the resource information in the cluster.

There are two components inside VolumeManager that periodically update DSW and ASW:

DesiredStateOfWorldPopulator: run periodically through a GoRoutine, the main function is to update the DSW;Reconciler: run periodically through a GoRoutine to ensure that the volume is mounted / unmounted. During this period, ASW is constantly updated:

UnmountVolumes: make sure that volumes is unmount after Pod is deleted. Iterate through the Pod in all ASW. If it is not in the DSW (indicating that the Pod has been deleted), take VolumeMode=FileSystem as an example, do the following:

Remove all bind-mounts: call the TearDown API of Unmounter (call the NodeUnpublishVolume API of CSI plug-in if out-of-tree); Unmount volume: call the UnmountDevice function of DeviceUnmounter (or NodeUnstageVolume API of CSI plug-in if out-of-tree); update ASW.

MountAttachVolumes: make sure that the volumes to be used by Pod is mounted successfully. Iterate through the Pod in all DSW. If it is not in ASW (indicating that the directory to be mounted is mapped to Pod), take VolumeMode=FileSystem as an example, and do the following:

Wait for volume to be attached to the node (mounted by External Attacher or Kubelet itself); mount volume to global catalog: call MountDevice function of DeviceMounter (call NodeStageVolume API of CSI plug-in if out-of-tree); update ASW: the volume has been mounted to global catalog; bind-mount volume to Pod: call SetUp API of Mounter (call NodePublishVolume API of CSI plug-in if out-of-tree); update ASW.

UnmountDetachDevices: make sure that the volumes that needs unmount is unmount. Iterate through the UnmountedVolumes in all ASW, and if it is not in DSW (indicating that volume is no longer in use), do the following:

Unmount volume: call the UnmountDevice function of DeviceUnmounter (if out-of-tree, call the NodeUnstageVolume API of the CSI plug-in); update ASW. Summary

This paper first introduces the basic concept and usage of K8s persistent storage, and deeply analyzes the internal storage process of K8s. On K8s, the use of any kind of storage is inseparable from the above process (attach/detach will not be used in some scenarios), and the storage problem in the environment must be due to a failure in one of the links.

Containers store more pits, especially in proprietary cloud environments. But the more challenges, the more opportunities! At present, the domestic proprietary cloud market is also competitive in the field of storage. We Agile PaaS container team welcome heroes to join us to create the future together!

Refer to the link Kubernetes community source code [cloud native open course] Kubernetes storage architecture and plug-in use (Junbao) [cloud native open course] application storage and persistence of data volumes-core knowledge (to heaven) [kubernetes-design-proposals] volume-provisioning [kubernetes-design-proposals] CSI Volume Plugins in Kubernetes Design Doc cloud native application team is hiring!

Aliyun's native application platform team is currently thirsty for talent, if you meet:

Full of enthusiasm for cloud native technologies in container and infrastructure related fields, such as Kubernetes, Serverless platform, container network and storage, operation and maintenance platform and other cloud native infrastructure, which have rich accumulation and outstanding achievements (such as product landing, innovative technology implementation, open source contribution, leading academic achievements)

Excellent presentation, communication and teamwork skills; forward-looking thinking on technology and business; strong ownership, results-oriented, good at decision-making

Familiar with at least one programming language in Java and Golang

Bachelor degree or above, more than 3 years working experience.

Resumes can be sent to email: huizhi.szh @ alibaba-inc.com. If you have any questions, please add Wechat to consult: TheBeatles1994.

Cloud native webinar invites you to attend

Click to book a live broadcast now

"Alibaba Cloud Native focus on micro services, Serverless, containers, Service Mesh and other technology areas, focus on cloud native popular technology trends, cloud native large-scale landing practice, to be the official account of cloud native developers."

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report