Getting started with K8s from scratch | Application Storage and persistence of data volumes: storage Snapshot and Topology scheduling 07/13 Update SLTechnology News&Howtos

Getting started with K8s from scratch | Application Storage and persistence of data volumes: storage Snapshot and Topology scheduling

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/03 Report--

Author | Alibaba Senior R & D engineer

Background of basic knowledge storage snapshot generation

When using storage, in order to improve the fault tolerance of data operations, we usually need to snapshot online data, as well as the ability to restore quickly. In addition, when online data needs to be quickly copied and migrated, such as environmental replication, data development and other functions, snapshots can be stored to meet the needs, while the function of storing snapshots is realized through CSI Snapshotter controller in K8s.

Storage snapshot user interface-Snapshot

We know that K8s simplifies users' use of storage through the design system of pvc and pv, while the design of storage snapshot is actually modeled on the design idea of pvc & pv system. When users need the function of storing snapshots, they can declare it through the VolumeSnapshot object and specify the corresponding VolumeSnapshotClass object, then the relevant components in the cluster dynamically generate the storage snapshot and the object VolumeSnapshotContent corresponding to the snapshot. As shown in the figure below, the process of dynamically generating VolumeSnapshotContent and dynamically generating pv is very similar.

Cdn.com/acb0479b57c985f69d45f6d5062471cc62e1abac.png ">

Storage snapshot user interface-Restore

After you have a storage snapshot, how can you recover the snapshot data quickly? As shown in the following figure:

With the process shown above, its dataSource field can be specified as a VolumeSnapshot object with the help of the PVC object. In this way, when the PVC is submitted, the relevant components in the cluster will find the storage snapshot data that the dataSource points to, and then create the corresponding storage and pv objects to restore the storage snapshot data to the new pv, so that the data will be restored, which is the restore usage for storing snapshots.

Topolopy- meaning

First of all, let's take a look at what the topology means: the topology here is a "location" relationship divided for the managed nodes in the K8s cluster, meaning that a node belongs to a topology by filling in the labels information of the node. There are three common ones, and these three are often encountered in use:

First, when using cloud storage services, we often encounter the concept of region, that is, the concept of region, which is often identified by label failure-domain.beta.kubernetes.io/region in K8s. This is to identify which region the cross-region nodes managed by a single K8s cluster belongs to.

Second, the more commonly used is the availability zone, that is, available zone, which is often identified by label failure-domain.beta.kubernetes.io/zone in K8s. This is to identify which availability zone the cross-zone nodes managed by a single K8s cluster belongs to.

The third is hostname, which is the stand-alone dimension, and the topology domain is within the scope of node, which is often identified by label kubernetes.io/hostname in K8s, which will be described in detail when talking about local pv at the end of the article.

The three topologies mentioned above are more commonly used, but the topologies can actually be defined by themselves. You can define a string to represent a topology domain, and the value corresponding to this key is actually a different topology location under the topology domain.

For example, you can use the latitude of rack, that is, the rack in the computer room, to do a topology domain. In this way, the machines on different racks (rack) can be marked as different topological locations, that is, the location relationships of machines on different racks can be identified by the latitude of rack. For machines that belong to rack1, node label adds the logo of rack, and its value is identified as rack1, that is, machines on another set of racks of rack=rack1; can be identified as rack=rack2, so that the location of node in K8s can be distinguished by the latitude of the rack.

Next, let's take a look at the use of topology in K8s storage.

Background of storage topology scheduling

As we said in the last lesson, storage and computing resources are managed separately in K8s through PV's PVC system. If the created PV has an "access location" restriction, that is, it uses nodeAffinity to specify which node can access the PV. Why is there such a restriction on access location?

Because the process of creating pod and creating PV in K8s can actually be considered to be parallel, in this case, there is no way to ensure that the node that the pod finally runs can access the storage corresponding to the PV with location restrictions, resulting in the failure of pod. Here are two classic examples:

First of all, let's take a look at the example of Local PV. Local PV encapsulates the local storage on a node as PV and accesses the local storage by using PV. Why is there a need for Local PV? To put it simply, when you first use the PV or PVC system, it is mainly used for distributed storage, which depends on the network. If some businesses have very high performance requirements for it, accessing distributed storage through the network may not be able to meet its performance requirements. At this time, you need to use local storage, excluding the overhead of the network, the performance is often higher. But using local storage also has its disadvantages! Distributed storage can ensure high availability through multiple copies, but local storage requires businesses to use Raft-like protocols to achieve high availability of multiple replicas.

Next, let's take a look at the Local PV scenario. What might be the problem if there is no "access location" restriction on PV?

When the user finishes submitting the PVC, the K8s PV controller may be bound to the PV above the node2. However, the pod that really uses this PV may be scheduled on node1 when it is scheduled, resulting in that the pod cannot use this storage when it is up, because the real situation of pod is to use the storage above node2.

The second scenario (which will cause problems if you do not restrict the "access location" of PV):

If the nodes managed by the K8s cluster is distributed in multiple availability zones in a single region. When creating dynamic storage, the created storage belongs to availability zone 2, but after committing the pod that uses the storage, it may be scheduled to availability zone 1, so it may not be able to use this storage. Therefore, for example, Aliyun's cloud disk, that is, block storage, cannot be used across availability zones. If the created storage belongs to availability zone 2, but pod runs in availability zone 1, you cannot use this storage. This is the second common problem scenario.

Next, let's take a look at how to solve the above problem through storage topology scheduling in K8s.

Storage topology scheduling

First of all, let's sum up the previous two questions, both of which are when PV binds to PVC or dynamically generates PV, and I don't know which node the pod that will use it will be scheduled on later. However, the use of PV itself is limited by the topology location of the node where the pod resides. For example, in the Local PV scenario, I have to schedule the node on the specified PV before I can use that PV. For the second problem scenario, that is, across the availability zone, you must schedule the pod using the PV to the node in the same availability zone before you can use the Ali Cloud disk service. How to solve this problem in K8s?

To put it simply, in K8s, the binding operation of PV and PVC and the operation of dynamically creating PV do delay,delay to pod scheduling results, and then do these two operations. What's the advantage of that?

First of all, if the PV to be used is pre-allocated, such as Local PV, in fact, if you use the pod of this PV, its corresponding PVC has not yet been bound, you can use the scheduler to combine the computing resource requirements of pod (such as cpu/mem) and the PVC requirements of pod during the scheduling process, so that the selected node should not only meet the needs of computing resources, but also the pvc used by pod should be able to meet the nodeaffinity restrictions of binding pv. Secondly, the scenario of dynamically generating PV is actually equivalent to if you know the node that the pod is running, you can dynamically create the PV according to the topology information recorded on the node, that is, to ensure that the topology location of the newly created PV is the same as that of the running node, as in the example of Aliyun disk described above, since you know that the pod will run to availability zone 1 After that, when you create the storage, you can specify that it can be created in availability zone 1.

In order to implement the delayed binding and delayed creation of PV mentioned above, there are three related components that need to be changed in K8s:

PV Controller, also known as persistent volume controller, needs to support the deferred Binding operation. The other is the component that dynamically generates PV. If the pod scheduling result comes out, it will dynamically create PV according to the topology information of pod. The third component and one of the most important changes is kube-scheduler. When selecting a node node for pod, it should consider not only the computing resource requirements of pod for CPU/MEM, but also the storage requirements of this pod, that is, according to its PVC, it should first look at the node to be selected and whether it can meet the nodeAffinity of PV that can match this PVC. Or the process of dynamically generating PV, which needs to check whether the current node meets this topology limit according to the topology restrictions specified in the StorageClass, so as to ensure that the node finally selected by the scheduler can meet the topology restrictions of the storage itself.

This is the knowledge of storage topology scheduling.

Second, interpretation of use cases

Let's interpret the basics of the first part through the yaml use case.

Volume Snapshot/Restore example

Let's take a look at how to use storage snapshots: first, you need a cluster administrator to create a VolumeSnapshotClass object in the cluster. An important field in VolumeSnapshotClass is Snapshot, which specifies the volume plug-in used to actually create storage snapshots. This volume plug-in needs to be deployed in advance. We'll talk about this volume plug-in later.

Next, if a user wants to do a real storage snapshot, he needs to declare a VolumeSnapshotClass. VolumeSnapshotClass first specifies VolumeSnapshotClassName, and then a very important field it specifies is source. This source actually specifies what the data source of the snapshot is. This place specifies name as disk-pvc, that is, to create a storage snapshot through this pvc object. After committing the VolumeSnapshot object, the relevant components in the cluster will find the PV storage corresponding to the PVC and take a snapshot of the PV storage.

Once you have a storage snapshot, how do you recover data with a storage snapshot next? This is actually very simple, by declaring a new PVC object and in the DataSource under its spec to declare which VolumeSnapshot my data source comes from. Here, the disk-snapshot object is specified. When my PVC is submitted, the relevant components in the cluster will dynamically generate new PV storage. The data in this new PV storage comes from the storage snapshot made by this Snapshot.

Example of Local PV

Take a look at the yaml example of Local PV in the following figure:

Most of the Local PV is created statically, that is, to declare the PV object first. Since the Local PV can only be accessed locally, you need to declare the PV object in the PV object through nodeAffinity to restrict my access to the PV on a single node, that is, to add topology restrictions to the PV. For example, the key of the above topology is marked with kubernetes.io/hostname, that is, it can only be accessed in node1. If you want to use this PV, your pod must be dispatched to node1.

Since it's a static way to create PV, why do you need storageClassname here? As mentioned earlier, in Local PV, if you want it to work properly, you need to use the delayed binding feature. Since it is delayed binding, after the user has finished writing the PVC submission, even if there is a relevant PV in the cluster that can match it, it cannot do a match for the time being. That is to say, PV controller cannot do binding immediately. At this time, you have to use a means to tell PV controller when it is not possible to do binding immediately. The storageClass here is to cause this side effect. We can see that the provisioner in storageClass specifies no-provisioner, which is tantamount to telling K8s that it will not dynamically create PV. It mainly uses the VolumeBindingMode field of storageclass, called WaitForFirstConsumer, which can be simply considered as delayed binding.

When the user starts to submit the PVC, when pv controller sees the pvc, it will find the corresponding storageClass and find that the BindingMode is delayed binding, and it won't do anything.

Later, when the pod of this pvc is really used, when it is scheduled on the node that conforms to the pv nodeaffinity, the PVC used in the pod will be really bound to the PV. This ensures that the PVC is bound to the PV only after my pod is dispatched to the node, and the final guarantee is that the created pod can access the Local PV, that is, how to meet the topology restrictions of PV in static Provisioning scenarios.

Example of limiting Dynamic Provisioning PV Topology

When you take a look at dynamic Provisioning PV, how do you make topology restrictions?

Dynamic means that the dynamic creation of PV has the limitation of topological location, how to specify it?

First of all, in storageclass, you still need to specify BindingMode, that is, WaitForFirstConsumer, that is, delayed binding.

The second particularly important field is allowedTopologies, where the limitation lies. As you can see in the above figure, the topology limit is the level of the availability zone, which actually has two meanings:

The first layer means that when I create a PV dynamically, the created PV must be accessible in this availability zone; the second meaning is because it declares delayed binding, and when the scheduler finds that the PVC that uses it coincides with the storageclass, the dispatching pod will select the nodes located in the availability zone.

In short, it is necessary to ensure from two aspects: one is that the dynamically created storage should be accessible by this availability zone, and the other is that my scheduler should fall in this availability zone when selecting node, so as to ensure that my storage and the pod I want to use correspond to the node, the topology domain between them is in the same topology domain, and the user is writing the PVC file. The writing method is the same as before, mainly in storageclass to do some topology restrictions.

Third, operation demonstration

This section will demonstrate what was explained earlier in an online environment.

First of all, let's take a look at the K8s service built on my Ali cloud server. There are three node nodes in total. One master node, two node. Among them, the master node cannot schedule pod.

Look again, I have already laid out the plug-ins I need in advance, one is the snapshot plug-in (csi-external-snapshot), and the other is the dynamic cloud disk plug-in (csi-disk).

Now let's start the demonstration of snapshot. First, create a cloud disk dynamically, and then you can do snapshot. To create a cloud disk dynamically, you need to first create a storageclass, then dynamically create a PV based on the PVC, and then create a pod that uses it.

With more than one object, you can do snapshot now. First, take a look at the first configuration file you need to do snapshot: snapshotclass.yaml.

In fact, it specifies the plug-ins that need to be used when doing storage snapshots. This plug-in has just been demonstrated to be deployed, that is, the csi-external-snapshot-0 plug-in.

Next, create the volume-snapshotclass file, and after that, you start snapshot.

Then look at the snapshot.yaml,Volumesnapshot statement to create a storage snapshot, this place to specify the PVC just created to do the data source to do snapshot, then we start to create.

Let's see if the Snapshot has been created. As shown in the following figure, the content was created 11 seconds ago.

You can take a look at its contents, mainly looking at some of the information recorded by volumesnapshotcontent. After my snapshot came out, it recorded the ID returned to me by the cloud storage vendor. Then there is the snapshot data source, which is the PVC you just specified, through which you can find the corresponding PV.

This is probably the case with snapshot's demo: delete the snapshot you just created, or delete it through volumesnapshot. Then take a look, the dynamically created volumesnapshotcontent is also deleted.

Next, let's take a look at the dynamic PV creation process with some topology restrictions. First, create the storageclass, and then take a look at the restrictions made in storageclass. Storageclass first specifies its BindingMode as WaitForFirstConsumer, that is, to do deferred binding, and then there are topology restrictions on it. I have configured an availability zone-level limit in the allowedTopologies field.

After the PVC,PVC that you try to create is created, it should theoretically be in the pending state. Take a look, it is now because it wants to do deferred binding, because it does not use its pod, there is no way to do binding, and there is no way to dynamically create a new PV.

Next, create a pod that uses the pvc to see what happens, and see if pod,pod is also in pending.

Let's take a look at why pod is in pending state. We can take a look at the scheduling failure, the reason for scheduling failure: a node cannot be scheduled because of taint, this is actually master, and the other two node are not said to be bindable PV.

Why do two node have the error that there is no pv to bind? Isn't it created dynamically?

Let's take a closer look at the topology restrictions in storageclass. We know from the above explanation that the PV storage created using this storageclass must be accessible in the availability zone cn-hangzhou-d, and the pod using this storage must also be scheduled to the node of cn-hangzhou-d.

Let's take a look at whether this topology information is available on the node node. Of course, it won't work without it.

Take a look at the full amount of information about the first node, mainly looking for the information in its labels. You can see that there is indeed such a key in lables. That is to say, there is such a topology, but this is specified as cn-hangzhou-b, and just now storageclass is specified as cn-hangzhou-d.

Well, let's take a look at the topology information on another node that also says hangzhou-b, but the limit in our storageclass is d.

This makes it impossible to schedule pod on these two node in the end. Now let's modify the topology restrictions in storageclass and change cn-hangzhou-d to cn-hangzhou-b.

After the modification, I will take a look at it. In fact, it means that the PV created dynamically by me can be accessed by the availability zone of hangzhou-b, and the pod using this storage should be scheduled to the node of the availability zone. Delete the previous pod and let it be rescheduled to see what the result is. OK, now that this has been successfully scheduled, it is already in the startup container stage.

It shows that after changing the restriction on availability zones in storageclass from hangzhou-d to hangzhou-b, there are two node in the cluster, and its topological relationship matches the topological relationship required in storageclass, so that it can guarantee that its pod has node node schedulability. The last point in the figure above, Pod has been Running, indicating that the topology restrictions just changed can be work.

4. Processing flow kubernetes to Volume Snapshot/Restore processing flow

Next, take a look at the specific processing flow of storage snapshot and topology scheduling in K8s. As shown in the following figure:

First, let's take a look at the processing flow of storing snapshots, and let's first explain the csi section. All the extended functions of storage in K8s are recommended to be realized by csi out-of-tree.

The storage extension implemented by csi mainly consists of two parts:

The first part is the csi controller part driven by the K8s community, that is, the csi-snapshottor controller and csi-provisioner controller here, which are mainly general controller parts; the other part is the different csi-plugin parts implemented by specific cloud storage vendors with their own OpenAPI, also known as the driver part of storage.

The two parts are connected by unix domain socket communication. Only with these two parts can a real storage extension function be formed.

As shown in the figure above, when the user submits the VolumeSnapshot object, it will be csi-snapshottor controller watch to. After that, it will call csi-plugin,csi-plugin through GPPC to OpenAPI to actually store the snapshot. After the storage snapshot has been generated, it will be returned to csi-snapshottor controller, and csi-snapshottor controller will put the relevant information generated by the storage snapshot into the VolumeSnapshotContent object and bound the VolumeSnapshot submitted by the user. This bound is actually a bit like the bound of PV and PVC.

With the storage snapshot, how do you use the storage snapshot to restore the data before it? As mentioned earlier, by declaring a new PVC object and specifying its dataSource as the Snapshot object, it will be csi-provisioner watch when the PVC is submitted, and then the storage will be created through GRPC. There is a difference between creating storage here and csi-provisioner explained earlier, that is, it also specifies the ID of Snapshot. When you go to a cloud vendor to create storage, you need to take one more step, that is, restore the previous snapshot data to the newly created storage. The process then returns to csi-provisioner, which writes the information about the newly created storage to a new PV object, and the new PV object is PV controller watch so that it makes a bound between the PVC and PV submitted by the user, and then pod can use the Restore data through PVC. This is the process of storing snapshots in K8s.

Kubernetes to Volume Topology-aware Scheduling processing flow

Next, take a look at the processing flow of storage topology scheduling:

The first step is to declare delayed binding, which is done through StorageClass, which has been described above, so I won't elaborate on it here.

Next, let's take a look at the scheduler. The red part in the above figure is the new storage topology scheduling logic added by the scheduler. Let's first take a look at the general flow of the scheduler when it selects node for a pod without the red part:

First of all, after the user has submitted the pod, it will be watch to the scheduler, and it will do the pre-selection first. Pre-selection means that it will match all the node in the cluster with the resources needed by the pod. If there is a match, it means that the node can be used. Of course, more than one node can be used, and eventually a batch of node will be selected. Then, after the second stage of optimization, the optimization is equivalent to a scoring process for these node. After finding the most matching node; through scoring, the scheduler writes the scheduling result into the spec.nodeName field in the pod, and then it is sent to the kubelet watch on the corresponding node, and finally begins the whole process of creating the pod.

Now take a look at how node (the second step) is filtered when adding volume-related scheduling.

First, you need to find all the PVC used in pod, find the PVC that has been bound, and the PVC; that needs to be delayed binding. For the PVC that has already bound, you need to check whether the nodeAffinity in its corresponding PV matches the topology of the current node. If it does not match, it means that the node cannot be scheduled. If there is a match, go ahead and take a look at the PVC; that needs to be delayed binding and the PVC that needs to be delayed binding. First, get the PV of the stock in the cluster. If you meet the PVC requirements, you should first fish it out, and then match them with the current topology on the node labels one by one. If they (the PV of the stock) do not match, it means that the current PV of the stock cannot meet the demand. It is necessary to further check whether the current node of the PV dynamically meets the topology restrictions, that is, to further remove the topology restrictions in the check StorageClass. If the topology restrictions declared in the StorageClass match the topology in the existing labels on the current node, then the node can be used. If it does not match, the node cannot be scheduled.

After going through the above steps, you have found all the nodes that meets both the pod computing resource requirements and the pod storage resource requirements. When node is selected, the third step is an optimization made within the scheduler. The simple thing here is to update the node information of pod after pre-selection and optimization, as well as some cache information that PV and PVC do in scheduler.

The fourth step is also an important step, you have selected the Pod of node, regardless of whether the PVC it uses is to binding an existing PV, or to create a PV dynamically, you can start doing it. Triggered by the scheduler, the scheduler updates the relevant information in the PVC object and the PV object, and then triggers the PV controller to do the binding operation, or csi-provisioner to do the dynamic creation process.

This paper summarizes and explains the relevant K8s resource objects and usage of storing snapshots by comparing the PVC&PV system; through the problems encountered in two actual scenarios, it leads to the necessity of storage topology scheduling function and how to solve these problems through topology scheduling in K8s; through the analysis of the internal operation mechanism of storage snapshot and storage topology scheduling in K8s, we deeply understand the working principle of this part of the function.

Alibaba Cloud Native Wechat official account (ID:Alicloudnative) focuses on micro-services, Serverless, containers, Service Mesh and other technology areas, focuses on cloud native popular technology trends, and large-scale cloud native landing practices, so as to be the technical official account that best understands cloud native developers.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.