Getting started with K8s from scratch | Kubernetes storage architecture and insertion 07/04 Update SLTechnology News&Howtos

Getting started with K8s from scratch | Kubernetes storage architecture and insertion

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

Author | Senior Technical expert of Li Junbao Alibaba

This article is compiled from lesson 21 of the "CNCF x Alibaba Cloud Native Technology Open course". Follow the official account of "Alibaba Cloud Origin" and reply to the keyword "getting started" to download the article PPT of the K8s series from zero.

Introduction: container storage is the basic component of data persistence in Kubernetes system and an important guarantee for the realization of stateful service. Kubernetes provides mainstream storage volume access scheme (In-Tree) by default, as well as plug-in mechanism (Out-Of-Tree), which allows other types of storage services to access Kubernetes system services. This article will explain from the Kubernetes storage architecture, storage plug-in principle, implementation and other aspects, I hope you have something to gain.

1. Example of Kubernetes storage architecture: Mount a Volume in Kubernetes

First, a mount example of Volume is used as an introduction.

As shown in the following figure, the YAML template on the left defines an application of StatefulSet, which defines a volume named disk-pvc, and the directory mounted inside Pod is / data. Disk-pvc is a data volume of type PVC in which a storageClassName is defined.

So this template is a typical dynamic storage template. The figure on the right shows the process of mounting a data volume, which is mainly divided into 6 steps:

Step 1: the user creates a Pod containing PVC

Step 2: PV Controller keeps watching ApiServer. If it finds that a PVC has been created but is still unbound, it will try to bind a PV to PVC.

PV Controller will first find a suitable PV within the cluster to bind, and if the corresponding PV is not found, call Volume Plugin to do the Provision. Provision is to create a Volume from a specific storage medium on the remote end, create a PV object in the cluster, and then bind the PV to PVC.

Step 3: complete a scheduling function through Scheduler

We know that when a Pod is running, you need to select a Node, and the selection of this node is done by Scheduler. When Scheduler is scheduled, there will be multiple references, such as nodeSelector defined within Pod, definitions such as nodeAffinity, and some tags defined in Volume.

We can add some tags to the data volume so that the Pod using this pv will be scheduled to the desired node by the scheduler due to tag restrictions.

Step 4: if a Pod is dispatched to a node and the PV defined by it has not been mounted (Attach), then AD Controller will call VolumePlugin to mount the remote Volume to the device in the target node (for example: / dev/vdb)

Step 5: when Volum Manager finds that a Pod is dispatched to its own node and Volume has finished mounting, it performs a mount operation to mount the local device (that is, the / dev/vdb you just got) to a subdirectory of Pod on the node. At the same time, it may also do some additional operations such as formatting, whether to mount to GlobalPath, and so on.

Step 6: the binding operation is to map the Volume that has been mounted locally to the container. Storage architecture of Kubernetes

Next, let's take a look at Kubernetes's storage architecture.

PV Controller: responsible for PV/PVC binding, lifecycle management, and Provision/Delete operation of data volumes according to requirements

AD Controller: responsible for the Attach/Detach operation of the storage device, mounting the device to the target node

Volume Manager: manages volume Mount/Unmount operations, volume device formatting, and mounting operations to some public directories

Volume Plugins: it mainly implements all the mount functions above

PV Controller, AD Controller and Volume Manager are mainly called for operation, while the specific operation is implemented by Volume Plugins.

Scheduler: realize the scheduling ability of Pod, and do some storage-related scheduling according to some storage-related definitions.

Next, we will introduce the functions of these parts respectively.

PV Controller

First, let's review a few basic concepts:

Persistent Volume (PV): persisted storage volume, defining the parameters of premounted storage space in detail

For example, when we mount a remote NAS, the specific parameters of the NAS are defined in the PV. A PV has no NameSpace restrictions, and it is generally created and maintained by Admin

Persistent Volume Claim (PVC): persistent storage declaration

It is the storage interface used by users and is not aware of the storage details. It mainly defines some basic storage Size and AccessMode parameters in it, and it belongs to a certain NameSpace.

StorageClass: storage class

A dynamic storage volume creates a PV according to the template defined by StorageClass, which defines some of the parameters needed to create the template and a Provisioner to create the PV (that is, who created it).

The main task of PV Controller is to complete the lifecycle management of PV and PVC, such as creating and deleting PV objects, responsible for the state transition of PV and PVC; another task is to bind PVC and PV objects. A PVC must be bound to a PV before it can be used by applications. They are bound one by one, and a PV can only be bound by one PVC, and vice versa. Next, let's take a look at the state transition diagram of a PV.

After creating a PV, we are in the state of an Available. When a PVC is bound to a PV, the PV enters the state of Bound. At this time, if we delete the PVC, the PV of the state of Bound will enter the state of Released.

A PV with a Released state will decide whether to enter an Available state or a Deleted state based on its own defined ReclaimPolicy field. If ReclaimPolicy defines a "recycle" type, it enters an Available state, and if the transition fails, it enters a Failed state.

The state transition diagram of PVC is relatively simple.

A created PVC will be in Pending state. When a PVC is bound to PV, PVC will enter Bound state, and when the PV of a Bound state PVC will be deleted, the PVC will enter a Lost state. For a PVC in the Lost state, if its PV is recreated and re-bound to the PVC, the PVC will return to the Bound state.

The following figure is a flowchart of PV filtering when PVC unbinds PV. That is to say, when a PVC binds a PV, what kind of PV should be chosen to bind.

First it checks the VolumeMode tag, and the VolumeMode tags of PV and PVC must match. VolumeMode mainly defines whether our data volume is a file system (FileSystem) type or a Block type.

The second part is LabelSelector. When LabelSelector is defined in PVC, we will select those PV that have Label and match the LabelSelector of PVC to bind

The third part is the inspection of StorageClassName. If a StorageClassName is defined in PVC, a PV with the same class name is required to be filtered.

Here is a specific explanation of the StorageClassName tag, the purpose of the tag is that when a PVC can not find the corresponding PV, we will use the StorageClass specified by the tag to do a dynamic creation of PV operation, at the same time, it is also a binding condition, when there is a PV that meets this condition, we will directly use the existing PV instead of dynamically creating it.

The fourth part is AccessMode inspection.

AccessMode is the tags such as "ReadWriteOnce" and "RearWriteMany" that we usually define in PVC. The binding condition is that PVC and PV must have a matching AccessMode, that is, the AccessMode type required by PVC, and PV must have.

The last part is the inspection of Size.

The Size of a PVC must be less than or equal to the Size of PV, because PVC is a declared Volume, and the actual Volume must be greater than or equal to the declared Volume before binding.

Next, let's look at an implementation of a PV Controller.

There are two main implementation logic in PV Controller: one is ClaimWorker; and the other is VolumeWorker.

ClaimWorker implements the state transition of PVC.

The status of a PVC is identified by the system label "pv.kubernetes.io/bind-completed".

If the tag is True, it means that our PVC has been bound, and we only need to synchronize some internal states at this time; if the tag is False, our PVC is unbound.

At this point, you need to check the PV in the entire cluster to filter. Through findBestMatch, you can filter all PV, that is, according to the five binding criteria mentioned earlier. If PV is filtered, perform a Bound operation, otherwise do a Provision operation and create a PV yourself.

And look at the operation of VolumeWorker. It implements the state transition of PV.

To judge by the ClaimRef tag in PV, if the tag is empty, it means that the PV is in the status of an Available, and you only need to do a synchronization at this time; if the tag is not empty and the value is a value of PVC, we will look up the corresponding PVC in the cluster. If the PVC exists, the PV is in the state of a Bound, and some corresponding status synchronization will be done at this time; if the PVC cannot be found, the PV is in a bound state, and the corresponding PVC has been deleted, and the PV is in the state of a Released. At this point, it is decided whether to delete or only synchronize some states according to whether ReclaimPolicy is Delete or not. The above is the brief implementation logic of PV Controller.

AD Controller

AD Controller is an abbreviation for Attach/Detach Controller.

It has two core objects, DesiredStateofWorld and ActualStateOfWorld.

DesiredStateofWorld is the mount state of the data volume expected to be achieved in the cluster; ActualStateOfWorld is the actual mount state of the data volume within the cluster.

It has two core logic, desiredStateOfWorldPopulator and Reconcile.

DesiredStateOfWorldPopulator is mainly used to synchronize some data of the cluster as well as updates of DSW and ASW data. It will synchronize the status of these data to the DSW when we create a new PVC and create a new Pod in the cluster.

Reconcile synchronizes the state according to the state of the DSW and ASW objects. It changes the ASW state to the DSW state, and during the transition of this state, it performs Attach, Detach, and so on.

The following table shows a specific example of desiredStateOfWorld and actualStateOfWorld objects, respectively.

DesiredStateOfWorld defines each Worker, including the Volume contained in the Worker and some information that attempts to mount; actualStateOfWorl defines all the Volume at once, including the node to which each Volume is expected to be mounted, the mount state, and so on.

The following figure is a logical block diagram of the AD Controller implementation.

We can see that there are many Informer,Informer in AD Controller that synchronize the Pod status, PV status, Node status, and PVC status of the cluster to the local.

During initialization, populateDesireStateofWorld and populateActualStateofWorld are called to initialize desireStateofWorld and actualStateofWorld.

During execution, the data is synchronized through the desiredStateOfWorldPopulator, that is, the data state in the cluster is synchronized to the desireStateofWorld. Reconciler synchronizes the data of actualStateofWorld and desireStateofWorld by polling. During synchronization, attach and detach operations are performed by calling Volume Plugin, and it also calls nodeStatusUpdater to update the status of Node.

The above is the brief implementation logic of AD Controller.

Volume Manager

Volume Manager is actually part of Kubelet, one of the many Manager in Kubelet. It is mainly used to do the Attach/Detach/Mount/Unmount operation of the Volume of this node.

Like AD Controller, it contains desireStateofWorld and actualStateofWorld, as well as a volumePluginManager object, which mainly manages plug-ins on nodes. Similar to AD Controller in core logic, it synchronizes data through desiredStateOfWorldPopulator and invokes interfaces through Reconciler.

Here we need to talk about the two operations of Attach/Detach:

We mentioned earlier that AD Controller also does Attach/Detach operations, so who does it? We can define it by the "- enable-controller-attach-detach" tag, and if it is True, it is controlled by AD Controller; if it is False, it is done by Volume Manager.

It is a label of Kubelet, which can only define the behavior of a node, so if you assume a cluster with 10 nodes, it has 5 nodes that define the label as False, which means that these 5 nodes are mounted by Kubelet on the node, while the other 5 nodes are mounted by AD Controller.

The following figure is the logic diagram of the Volume Manager implementation.

We can see that the outermost layer is a loop, and the inside is a poll based on different objects, including desireStateofWorld and actualStateofWorld.

For example, polling the MountedVolumes object in actualStateofWorld, and if one of the Volume exists in desireStateofWorld at the same time, this means that both the actual and expected Volume are mounted, so we won't do anything about it. If it does not exist in the desireStateofWorld, indicating that the Volume should be in the Umounted state in the desired state, the UnmountVolume is executed to change its state to the same state in the desireStateofWorld.

So we can see: in fact, the process is based on the comparison of desireStateofWorld and actualStateofWorld, and then call the underlying interface to perform the corresponding operation, and the following operation of desireStateofWorld.UnmountVolumes and actualStateofWorld.AttachedVolumes is the same.

Volume Plugins

The PV Controller, AD Controller and Volume Manager we mentioned earlier actually do some PV and PVC management by calling the interfaces provided by Volume Plugin, such as Provision, Delete, Attach, Detach, etc. The specific implementation logic of these interfaces is placed in VolumePlugin.

According to the location of the source code, Volume Plugins can be divided into In-Tree and Out-of-Tree:

In-Tree indicates that the source code is placed inside Kubernetes and released, managed and iterated together with Kubernetes, which has the disadvantages of slow iterative speed and poor flexibility. The code of Volume Plugins of Out-of-Tree class is independent of Kubernetes and is implemented by the storage provider. At present, there are mainly two implementation mechanisms: Flexvolume and CSI, which can implement different storage plug-ins according to the storage type. Therefore, we prefer the implementation logic of Out-of-Tree.

We can see from the location that Volume Plugins is actually a library called by PV Controller, AD Controller, and Volume Manager, which is divided into two types of Plugins: In-Tree and Out-of-Tree. It invokes remote storage through these implementations, such as mounting a NAS operation "mount-t nfs * *". This command is actually implemented in Volume Plugins, which invokes a remote storage to mount locally.

In terms of types, Volume Plugins can be divided into many types. In-Tree contains dozens of common storage implementations, but some companies define their own private types and have their own API and parameters, which are not supported by public storage plug-ins, so storage implementations of Out-of-Tree classes, such as CSI and FlexVolume, are needed.

The specific implementation of Volume Plugins will be discussed later. Here's a look at the plug-in management of Volume Plugins.

Kubernetes does plug-in management in PV Controller, AD Controller, and Volume Manager. Managed through the VolumePlguinMg object. It mainly contains two data structures: Plugins and Prober.

Plugins is mainly an object used to hold the Plugins list, while Prober is a probe for discovering new Plugin. For example, FlexVolume and CSI are plug-ins for extensions that are created and generated dynamically, so we can't predict at first, so we need a probe to discover new Plugin.

The following figure shows the whole process of plug-in management.

PV Controller, AD Controller, and Volume Manager execute an InitPlugins method to initialize VolumePluginsMgr at startup.

It will first add all In-Tree Plugins to our plug-in list. At the same time, the init method of Prober is called, which first calls an InitWatcher, which keeps an eye on a directory (such as / usr/libexec/kubernetes/kubelet-plugins/volume/exec/ in the figure). Every time a new file is generated in this directory, that is, a new Plugins is created, a new FsNotify.Create event is generated and added to the EventsMap. Similarly, if a file is deleted, a FsNotify.Remove event is generated and added to the EventsMap.

When the upper layer calls refreshProbedPlugins, Prober updates these events, adding it to the plug-in list if it is Create, or removing a plug-in from the plug-in list if it is Remove.

The above is the plug-in management mechanism of Volume Plugins.

Kubernetes storage volume scheduling

We said earlier that Pod must be scheduled to a Worker in order to run. When scheduling Pod, we use different schedulers for filtering, including some Volume-related schedulers. For example, VolumeZonePredicate, VolumeBindingPredicate, CSIMaxVolumLimitPredicate, etc.

VolumeZonePredicate will check the Label in the PV, such as the failure-domain.beta.kubernetes.io/zone tag. If the tag defines the information of the zone, the VolumeZonePredicate will make a decision accordingly, that is, the node must conform to the corresponding zone before it can be scheduled.

For example, the example on the left side of the figure defines a zone for label as cn-shenzhen-a. The PV on the right defines a nodeAffinity that defines the Label of the node expected by the PV, and the Label is filtered by VolumeBindingPredicate.

For the implementation of specific storage volume scheduling information, please refer to "getting started with K8s from scratch | Application Storage and persistence of data volumes: storage Snapshot and Topology scheduling". Here will be a more detailed description.

II. Introduction and use of Flexvolume

Flexvolume is an extension of Volume Plugins that mainly implements these interfaces of Attach/Detach/Mount/Unmount. We know that these functions are originally implemented by Volume Plugins, but for some storage types, we need to extend them beyond Volume Plugins, so we need to put the specific implementation of the interface outside.

In the following figure, we can see that Volume Plugins actually contains a part of the Flexvolume implementation code, but this part of the code actually only has a "Proxy" function.

For example, when AD Controller calls an Attach of a plug-in, it first calls the Attach interface of Flexvolume in Volume Plugins, but this interface simply transfers the call to the corresponding Out-Of-Tree implementation of Flexvolume.

Flexvolume is an executable file that can be driven by Kubelet. Each call to a script like ls of shell is a command line call to an executable file, so it is not a memory-resident daemon.

The Stdout of Flexvolume is returned as a result of the Kubelet call, which needs to be in JSON format.

The default location for Flexvolume is "/ usr/libexec/kubernetes/kubelet-plugins/volume/exec/alicloud~disk/disk".

The following is an example of a command format and invocation.

Introduction to the interface of Flexvolume

Flexvolum includes the following interfaces:

Init: mainly do some initialization operations, such as init when you deploy the plug-in and update the plug-in, and return the data structure of the DriveCapabilities type we just mentioned to illustrate the functions of our Flexvolume plug-in.

GetVolumeName: returns the plug-in name

Attach: the implementation of the mount function. According to the-- enable-controller-attach-detach tag to decide whether to initiate the mount operation by AD Controller or Kubelet

WaitforAttach: Attach is often an asynchronous operation, so you need to wait for the mount to complete before you need to do the following

MountDevice: it's part of mount. Here we divide mount into two parts: MountDevice and SetUp. MountDevice mainly does some simple preprocessing, such as formatting and mounting the device to the GlobalMount directory.

GetPath: get the local mount directory corresponding to each Pod

Setup: use Bind to mount devices in GlobalPath to the local directory of Pod

TearDown, UnmountDevice and Detach implement the reverse process of some of the above excuses.

ExpandVolumeDevice: expand the storage volume, which is called by Expand Controller

NodeExpand: expand the file system, which is called by Kubelet.

All of the above interfaces do not necessarily need to be implemented. If an interface is not implemented, the returned result can be defined as:

{"status": "Not supported", "message": "error message"}

Tell the caller that the interface is not implemented. In addition, the Flexvolume interface in Volume Plugins provides some default implementations, such as Mount operations, in addition to acting as a Proxy. So if the interface is not defined in your Flexvolume, the default implementation will be called.

When defining PV, you can define some of the functions of secret through the secretRef field. For example, the user name and password required for mounting can be passed through secretRef.

Mounting Analysis of Flexvolume

The mount process of Flexvolume is analyzed from two aspects: the mount process and the uninstall process.

Let's first look at the Attach operation, which calls a remote API to mount our Storage to some device in the target node. Then mount the local device to the GlobalPath through MountDevice, and do some formatting. The Mount operation (SetUp), which mounts the GlobalPath in PodPath, and PodPath is a directory that Pod maps when it starts.

The following figure shows an example. For example, we have a cloud disk whose Volume ID is d-8vb4fflsonz21h41cmss. After performing Attach and WaitForAttach operations, we will mount it to the / dec/vdc device on the target node. After the MountDevice is executed, the above devices are formatted and mounted to a local GlobalPath. After the Mount is executed, the GlobalPath is mapped to a subdirectory related to Pod. Finally, perform the Bind operation to map our local directory to the container. This completes a mount process.

The unloading process is a reverse process. The above process describes the mounting process of a block device. For file storage types, there is no need for Attach and MountDevice operations, but only Mount operations. Therefore, the Flexvolume implementation of the file system is relatively simple, requiring only Mount and Unmount procedures.

Code example for Flexvolume

The main implementation is init (), doMount (), doUnmount () methods. When the script is executed, the parameters passed in are judged to determine which command to execute. There are many examples of Flexvolume on Github, which you can refer to for yourself. Aliyun provides an implementation of Flexvolume, which you can refer to if you are interested.

The use of Flexvolume

The following figure shows a PV template of type Flexvolume. It is actually no different from other templates, except that the type is defined as the flexVolume type. Driver, fsType, options are defined in flexVolume.

Driver defines some kind of driver we implement, such as aliclound/disk or aliclound/nas in the figure; fsType defines the file system type, such as "ext4"; and options contains some specific parameters, such as the id that defines the cloud disk.

We can also define some filters through matchLabels in selector, just like other types. You can also define some corresponding scheduling information, such as defining zone as cn-shenzhen-a.

The following is a specific running result. We mount a cloud disk inside Pod, and its local device is / dev/vdb. Through mount | grep disk, we can see the corresponding mount directory. First, it will mount / dev/vdb to GlobalPath; secondly, it will mount GlobalPath to a local subdirectory defined by Pod through the mount command; finally, it will map the local subdirectory to / data.

III. Introduction and use of CSI

Like Flexvolume, CSI is an abstract interface that provides a volume implementation for third-party storage.

With Flexvolume, why do you still need CSI? Flexvolume is only for kubernetes as an orchestration system, while CSI can meet the needs of different orchestration systems, such as Mesos,Swarm.

Secondly, CSI is containerized deployment, which can reduce environment dependence, enhance security, and enrich plug-in functions. We know that Flexvolume is a binary file in host space, and executing Flexvolum is equivalent to executing a local shell command, which makes us need to install some dependencies when we install Flexvolume, and these dependencies may have some impact on the customer's application. Therefore, in terms of security and environmental dependence, there will be a bad impact.

At the same time, with regard to enriching plug-in features, when we implement operator in Kubernetes ecology, we often use RBAC to call some Kubernetes APIs to achieve certain functions, and these functions must be implemented inside the container. Therefore, environments like Flexvolume cannot achieve these functions because it is a binary program in host space. CSI, a containerized deployment, can achieve these functions through the way of RBAC.

CSI mainly consists of two parts: CSI Controller Server and CSI Node Server.

Controller Server is the function of the control side, which mainly realizes the functions of creating, deleting, mounting and uninstalling, while Node Server mainly realizes the functions of mount and Unmount on the node.

The following figure shows a description of CSI interface communication. CSI Controller Server and External CSI SideCar communicate through Unix Socket, and CSI Node Server and Kubelet also communicate through Unix Socket. We'll talk about the specific concept of External CSI SiderCar later.

The following figure shows the interface of CSI. It is mainly divided into three categories: general management and control interface, node management and control interface and central management and control interface.

The general management and control interface mainly returns some general information of CSI, such as the name of the plug-in, the identity information of Driver, the capabilities provided by the plug-in, etc.

The NodeStageVolume and NodeUnstageVolume of the node management and control interface are equivalent to MountDevice and UnmountDevice in Flexvolume. NodePublishVolume and NodeUnpublishVolume are equivalent to SetUp and TearDown interfaces

CreateVolume and DeleteVolume of the central control interface are interfaces for our Provision and Delete storage volumes, while ControllerPublishVolume and ControllerUnPublishVolume are interfaces for Attach and Detach, respectively.

The system structure of CSI

CSI is implemented in the form of CRD, so CSI introduces several object types: VolumeAttachment, CSINode, CSIDriver, and an implementation of CSI Controller Server and CSINode Server.

In CSI Controller Server, there are traditional AD Controller and Volume Plugins,VolumeAttachment objects like those in Kubernetes that are created by them.

In addition, there are multiple External Plugin components, each of which performs a function when combined with CSI Plugin. For example:

The creation and deletion of data volumes are completed when External Provisioner and Controller Server are combined; External Attacher and Controller Server are combined to mount and operate data volumes; External Resizer and Controller Server are combined to perform volume expansion operations; and External Snapshotter and Controller Server are combined to complete snapshot creation and deletion.

CSI Node Server mainly contains Kubelet components, including VolumeManager and VolumePlugin, which will call CSI Plugin to do mount and unmount operations; another component, Driver Registrar, mainly implements the function of CSI Plugin registration.

This is the entire topology of CSI, and we will introduce different objects and components next.

CSI object

We will introduce three kinds of objects: VolumeAttachment,CSIDriver,CSINode.

VolumeAttachment describes information about how a Volume volume is mounted and unmounted in a Pod usage. For example, for a mount of a volume on a node, we track the mount through VolumeAttachment. AD Controller creates a VolumeAttachment, while External-attacher mounts and unmounts the VolumeAttachment based on its status by observing it.

The following figure is an example of VolumeAttachment. The category (kind) specifies attacher as ossplugin.csi.alibabacloud.com in VolumeAttachment,spec, that is, by whom the mount is operated; nodeName as cn-zhangjiakou.192.168.1.53, that is, the node on which the mount occurs; and source, persistentVolumeName and oss-csi-pv, that is, which data volume is specified for mounting and unmounting.

Attached in status indicates the mount status, and if it is False, External-attacher will perform a mount operation.

The second object is CSIDriver, which describes the list of CSI Plugin deployed in the cluster, which needs to be created by the administrator based on the plug-in type.

For example, some CSI Driver are created in the following figure. Through kuberctl get csidriver, we can see three types of CSI Driver created in the cluster: one is cloud disk, the other is NAS; and the other is OSS.

In CSI Driver, we define its name, and we define two tags, attachRequired and podInfoOnMount, in spec.

AttachRequired defines whether a Plugin supports Attach functionality, mainly to distinguish between block storage and file storage. For example, file storage does not require Attach operation, so defining the tag as False;podInfoOnMount defines whether Kubernetes carries Pod information when calling the Mount API.

The third object is CSINode, which is the node information in the cluster and is created by node-driver-registrar at startup. Its function is to add a CSINode information to the CSINode list after each new CSI Plugin registration.

For example, the following figure defines a list of CSINode, and each CSINode has a specific information (YAML on the left). Take a cn-zhangjiakou.192.168.1.49 as an example, which contains a CSI Driver of a cloud disk and a CSI Driver of a NAS. Each Driver has its own nodeID and its topology information topologyKeys. If you do not have topology information, you can set topologyKeys to "null". That is, if you have a cluster with 10 nodes, we can define that only some of the nodes have CSINode.

Node-Driver-Registrar of CSI components

Node-Driver-Registrar mainly implements a mechanism for CSI Plugin registration. Let's take a look at the flow chart in the following figure.

Step 1, there is a convention at startup, for example, every new file added in the / var/lib/kuberlet/plugins_registry directory is equivalent to each new Plugin added.

Start Node-Driver-Registrar, which first initiates an interface call GetPluginInfo to CSI-Plugin, which returns the address that CSI listens to and a Driver name of CSI-Plugin

In step 2, Node-Driver-Registrar listens on both GetInfo and NotifyRegistrationStatus interfaces

In step 3, a Socket is launched under the / var/lib/kuberlet/plugins_registry directory to generate a Socket file, such as "diskplugin.csi.alibabacloud.com-reg.sock". When Kubelet discovers the Socket through Watcher, it will call the GetInfo interface of Node-Driver-Registrar through the Socket. GetInfo will return the CSI-Plugin information we just obtained to Kubelet, which contains the listening address of CSI-Plugin and its Driver name.

In step 4, Kubelet calls the NodeGetInfo interface of CSI-Plugin through the obtained listening address

In step 5, after the call is successful, Kubelet will update some status information, such as the node's Annotations, Labels, status.allocatable, etc., and create a CSINode object

Step 6, tell Node-Driver-Registrar that we have registered CSI-Plugin successfully by calling it's NotifyRegistrationStatus interface.

The CSI Plugin registration mechanism is implemented through the above six steps.

External-Attacher of CSI components

External-Attacher mainly realizes the function of mounting and unloading data volumes through the interface of CSI Plugin. It judges the state by observing the VolumeAttachment object. The VolumeAttachment object is created by calling CSI Attacher in Volume Plugin through AD Controller. CSI Attacher is an In-Tree class, which means that this part is done by Kubernetes.

When the state of VolumeAttachment is False, External-Attacher invokes an underlying Attach function; if the expected value is False, the Detach function is implemented through the underlying ControllerPublishVolume interface. At the same time, External-Attacher will also synchronize some PV information in it.

CSI deployment

Let's now take a look at the deployment of block storage.

The Controller mentioned earlier for CSI is divided into two parts, one is Controller Server Pod and the other is Node Server Pod.

We only need to deploy one Controller Server, and if we have multiple backups, we can deploy two. Controller Server is mainly implemented through multiple external plug-ins, for example, a Pod can define multiple External Container and a Container containing CSI Controller Server, at this time different External components and Controller Server will form different functions.

Node Server Pod, on the other hand, is a DaemonSet, which registers on each node. Kubelet will communicate with CSI Node Server directly through Socket, call Attach/Detach/Mount/Unmount, and so on.

Driver Registrar is just a registration function that will be deployed on each node.

The deployment of file storage and block storage is similar. It just removes the Attacher, and there are no VolumeAttachment objects.

Example of using CSI

Like Flexvolume, let's take a look at its definition template.

As you can see, it is no different from other definitions. The main difference is that the type is CSI, where driver,volumeHandle,volumeAttribute,nodeAffinity is defined, and so on.

Driver defines which plug-in is used to mount; volumeHandle is the only tag indicating PV; volumeAttribute is used for additional parameters. For example, if PV is defined as OSS, you can define bucket, address and other information in volumeAttribute, while nodeAffinity can define some scheduling information. Similar to Flexvolume, you can also define some binding conditions through selector and Label.

The middle diagram shows an example of dynamic scheduling, which is the same as other types of dynamic scheduling. It's just that you specify a provisioner for CSI when you define provisioner.

A specific mount example is given below.

After Pod starts, we can see that Pod has mounted a / dev/vdb on / data. In the same way, it has a GlobalPath and a PodPath cluster in it. We can mount a / dev/vdb into a GlobalPath, which is the only directory where a PV of a CSI is determined on this node. A PodPath is the directory of a local node determined by a Pod, which maps the directory corresponding to the Pod to our container.

Other functions of CSI

In addition to mounting and uninstalling, CSI provides some additional functions. For example, some user name and password information is often needed when defining a template, and we can define it through Secret. The Flexvolume we mentioned earlier also supports this feature, but CSI can define different Secret types according to different phases, such as Secret in the mount phase, Secret in the Mount phase, and Secret in the Provision phase.

Topology is a topology-aware feature. When we define a data volume, not all nodes in the cluster can meet the requirements of the data volume. For example, we need to mount different zone information in it, which is a topology-aware function. This part has been introduced in detail in Lecture 10, which you can refer to.

Block Volume is a definition of volumeMode, which can be defined as a Block type or a file system type. CSI supports Volume of Block type, that is, when mounted inside Pod, it is a block device, not a directory.

Skip Attach and PodInfo On Mount are the two functions of CSI Driver that we just talked about.

Recent Features of CSI

CSI is still a relatively new way of implementation. Recently, there have been many updates, such as ExpandCSIVolumes can expand the file system capacity; VolumeSnapshotDataSource can implement the snapshot function of data volumes; VolumePVCDataSource can define the data source of PVC; in the past, when we used CSI, we can only define it through PVC and PV, but not Volume,CSIInlineVolume directly in Pod, which allows us to define some CSI drivers directly in Volume.

Aliyun has opened up the implementation of CSI on GitHub. If you are interested, you can take a look at it and make some references.

IV. Summary of this article

This article mainly introduces the knowledge about storage volumes in Kubernetes cluster, including the following three points:

The first part describes the Kubernetes storage architecture, including the concept of storage volume, mount process, system components and other related knowledge; the second part describes the implementation principle, deployment architecture and use examples of the Flexvolume plug-in; the third part describes the implementation principle, resource objects, functional components and use examples of the CSI plug-in.

I hope the above knowledge points can make students gain something, especially in dealing with storage volume related design, development, fault handling and other aspects of help.

Cloud native webinar invites you to participate in the webinar topic

Kubernetes SIG-Cloud-Provider-Alibaba introduction

Date of holding

10:00, February 12, 2020 (time zone: Beijing)

Conference language

Chinese

Lecturer introduction

Topic introduction

SIG Cloud Provider is an important interest group for Kubernetes and is committed to driving all cloud vendors to provide Kubernetes services with standard capabilities. SIG-Cloud-Provider-Alibaba is the only sub-project of SIG Cloud Provider in China.

This seminar will introduce Ali Yun's layout of the Kubernetes community for the first time. At the product level, Aliyun provides a complete container product family; in the open source field, Aliyun also provides ten categories and more than 20 open source projects around Kubernetes, providing complete Kubernetes lifecycle management. Ali Yun will rely on SIG-Cloud-Provider-Alibaba to seek closer interaction with developers and call on more developers to contribute.

The benefits of participation are transparent and controllable: for developers of a research nature, you can build your own Kubernetes cluster based on providing plug-ins; for CCS ACK users, you can also understand the relevant implementation more transparently. Co-building collaboration: developers who need to use Kubernetes in computing, network, storage and other fields on Aliyun can mention Issue or participate in the development of open source components, and participate in the formulation of RoadMap. Smooth evolution: Aliyun Kubernetes open source plug-in provides the deployment capability of Day 1, but puts forward higher requirements for enterprise operation and maintenance, upgrade, stability control and so on. If you need expert services such as continuous upgrade, high availability guarantee and error correction recommendation of Day 2, you can smoothly evolve to CCS ACK. How to participate?

Click to register for the meeting: https://zoom.com.cn/webinar/register/8015799062779/WN_dIrSRs1zQ-uXNXmuAThuog

Recruiting TL;DR

Aliyun-Cloud Native Application platform-basic Software Middle Taiwan team (former Container platform basic Software team) invites Kubernetes/ container / Serverless/ application delivery technology experts (P6-P8) to join us.

Working years: it is recommended that P6-7 for three years and P8 for five years, depending on the actual ability. Place of work:

Domestic: Beijing, Hangzhou, Shenzhen; overseas: San Francisco Bay area, Seattle

Resume will be replied immediately and the result will be available in 2-3 weeks. Join the job after the holiday.

Work content

The basic products Division is the core R & D department of Aliyun Intelligent Business Group, which is responsible for the research and development of computing, storage, network, security, middleware, system software and so on. On the other hand, the basic software terminal team of cloud native application platform is committed to creating a stable, standard and advanced cloud native application system platform to promote the upgrading and revolution of cloud native technology in the industry.

Here, there are not only the co-chairmen of CNCF TOC and SIG, but also the top Kubernetes technical team composed of etcd founders, K8s Operator founders and Kubernetes core maintenance members.

Here, you will work closely with experts in the field of native cloud technology from around the world, such as the founders of the Helm project and the Istio project, to engage in the R & D and landing of ecological core cloud computing technologies such as Kubernetes, Service Mesh, Serverless, Open Application Model (OAM) and other ecological core technologies in a unique scenario and scale. On the benchmark platform of the industry, you will not only empower Alibaba in the global economy, but also serve developers and users all over the world.

Take Kubernetes as the core, promote and build the next generation "application-centric" basic technology system; in Ali economy scenario, R & D and implementation of "application-centric" infrastructure architecture and next-generation NoOps system based on Open Application Model (OAM), let Kubernetes and cloud native technology stack give play to real value and energy; research and development of multi-environment complex application delivery core technology Combine Ali and ecological core business scenarios to create industry standards and core dependencies for multi-environment complex application delivery (standard Google Cloud Anthos and Microsoft Azure Arc); design and development of core products and back-end architecture of cloud native application platform; with the support of ecological core technology and cutting-edge architecture, use technology to create sustainable vitality and competitiveness of cloud products in the platform scenario of world-class cloud vendors. Continue to promote the evolution of Ali economy application platform architecture, including Serverless infrastructure, standard cloud native standard PaaS construction, new generation application delivery system construction and other core technical work.

Technical requirements: Go/Rust/Java/C++,Linux, distributed system

Resume submission

Lei.zhang AT alibaba-inc.com

"Alibaba Cloud Native focus on micro-services, Serverless, containers, Service Mesh and other technology areas, focus on cloud native popular technology trends, cloud native large-scale landing practice, to be the best understanding of cloud native developers of the technology circle."

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.