How to do Kubernetes scheduling and resource management 07/03 Update SLTechnology News&Howtos

How to do Kubernetes scheduling and resource management

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

How to carry out Kubernetes scheduling and resource management, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain for you in detail, people with this need can come to learn, I hope you can gain something.

Kubernetes scheduling process

First, let's take a look at the first part-the scheduling process of Kubernetes. As shown in the figure below, a simple Kubernetes cluster architecture is drawn, which includes a kube-ApiServer, a set of Web-hook Controllers, and a default scheduler kube-Scheduler, as well as two physical machine nodes Node1 and Node2, on which two kubelet are deployed.

Let's take a look at what is the scheduling process if a pod is to be submitted to this Kubernetes cluster?

Suppose we have written a yaml file, which is the orange circle pod1 in the image below, and then submit the yaml file to kube-ApiServer.

At this point, ApiServer will first route the request to be created to our webhook Controllers for verification.

After passing the verification, ApiServer will generate a pod in the cluster, at this time the generated pod, its nodeName is empty, and its phase is Pending state. After the pod is generated, both kube-Scheduler and kubelet can watch the pod generation event. After kube-Scheduler discovers that the nodeName of the pod is empty, it will think that the pod is in an unscheduled state.

Next, it will schedule the pod into itself. After a series of scheduling algorithms, including a series of filtering and scoring algorithms, Schedule will select the most suitable node and bind the name of this node to the spec of the pod to complete a scheduling process.

At this point, we found that on the spec of pod, the nodeName has been updated to the node of Node1. After updating the nodeName, the kubelet on the Node1 will watch to a pod on the pod that belongs to its own node.

Then it will take the pod to the node to operate, including creating some container storage and network. Finally, when all the resources are ready, kubelet will update the status to Running, so a complete scheduling process is over.

Through a demonstration of a scheduling process, let's summarize the scheduling process in one sentence: it is actually doing something to put the pod on the appropriate node.

Here is the keyword "appropriate". What is appropriate? Here are a few well-defined features:

First of all, the resource requirements of pod should be met.

Secondly, it should meet the requirements of some special relationships of pod.

Once again, it is necessary to meet some of the restrictions of node.

Finally, it is necessary to make rational use of the resources of the whole cluster.

After achieving the above requirements, we can consider that we have put the pod on a suitable node.

Next I'll show you how Kubernetes meets these pod and node requirements.

Kubernetes basic dispatching force

The following is an introduction to the basic scheduling capabilities of Kubernetes. The basic scheduling capabilities of Kubernetes will be introduced in two parts:

The first part is resource scheduling-- introducing some basic Resources configuration methods of Kubernetes, as well as the concept of Qos, as well as the concept and use of Resource Quota.

The second part is relational scheduling-on relational scheduling, two relational scenarios are introduced:

The relationship scenario between pod and pod, including how to make love with a pod and how to repel a pod?

The relationship scenario between pod and node, including how to affinity a node and how to restrict pod scheduling with some node.

How to meet the resource requirements of Pod the resource allocation method of pod

The image above is a demo of pod spec. Our resources are actually filled in the pod spec, specifically in the resources of containers.

Resources consists of two parts:

The first part is requests.

The second part is limits.

The contents of the two parts are exactly the same, but the meanings are different: request represents some resource requirements for the basic guarantee of the pod, and limit represents a limit on the upper limit of the available capacity of the pod. The implementation of request and limit is a map structure, which can be filled with key/value of different resources.

We can roughly divide into four major categories of basic resources:

The first category is CPU resources.

The second category is memory.

The third category is ephemeral-storage, which is a temporary storage.

The fourth category is general extension resources, such as GPU.

CPU resources, for example, fill in 2 in the above example, apply for two CPU, and can also be written as a decimal conversion method of 2000m, to express sometimes the requirement that CPU may be a decimal, such as 0.2m CPU, which can be filled in 200m. And this way on top of memory and storage, it is a binary expression, as shown on the right side of the above figure, the application is the memory of 1GB, and it can also be filled into an expression of 1024mi, so that our needs for memory can be expressed more clearly.

In terms of extended resources, Kubernetes has a requirement that the extended resources must be integer, so we cannot apply for resources such as 0.5 GPU, we can only apply for 1 GPU or 2 GPU.

Here is an introduction to how to apply for basic resources.

Next, I'll tell you in detail what the difference is between request and limit, and how to represent QoS through request/limit.

Pod QoS Typ

K8S provides two ways to fill in pod resources: the first is request and the second is limit.

It actually provides users with a definition of the resilience of Pod. For example, we can enter 2 CPU for request and 4 CPU for limit, which means that I hope to have 2 CPU protection, but when it is idle, 4 GPU can be used.

Speaking of this resilience, we have to mention a concept: the concept of QoS. What is QoS? QoS, whose full name is Quality of Service, is a standard used by Kubernetes to express the quality of service of a pod in terms of resource capabilities. Kubernetes provides three types of QoS Class:

The first category is Guaranteed, which is a kind of high QoS Class. Generally, Guaranteed is allocated to some pods that need resource guarantee capability.

The second category is Burstable, which is a medium QoS label, which generally configures Burstable for pod that want to be resilient.

The third category is BestEffort, it is low QoS Class, we also know by name, it is a kind of best-effort quality of service, K8S does not promise to guarantee this kind of Pods service quality.

In fact, a bad thing about K8s is that users can't directly specify which kind of QoS their pod belongs to, but automatically map to QoS Class through the combination of request and limit.

Through the example in the figure above, you can see that if I submit a spec above, after the spec is successfully submitted, Kubernetes will automatically add a status with qosClass: Guaranteed in it. When you submit it, you cannot define your own QoS level. So this way is called recessive QoS class usage.

Pod QoS configuration

Next, let's talk about how we can determine the QoS level we want through the combination of request and limit.

Guaranteed Pod

How do we create a Guaranteed Pod first?

There is a requirement in Kubernetes: if you want to create a Guaranteed Pod, then your basic resources (including CPU and memory) must have its request==limit, and other resources can be unequal. It is only under this condition that the pod it creates is a kind of Guaranteed Pod, otherwise it will belong to Burstable or BestEffort Pod.

Burstable Pod

Then let's see how we can create a Burstable Pod,Burstable Pod with a broad scope. As long as it satisfies that the request of CPU/Memory is not equal to limit, it is a kind of Burstable Pod.

For example, in the above example, you don't have to fill in the resource of memory, just fill in the resource of CPU, it is a kind of Burstable Pod.

BestEffort Pod

The third kind of BestEffort Pod, it is also a way of using conditional death. It must be the request/limit of all resources are left empty, is a kind of BestEffort Pod.

So you can see here that different Pod QoS can be combined through the different uses of request and limit.

Different QoS performance

Next, let me introduce to you: what is the difference between scheduling and underlying performance of different QoS? Different QoS actually have some differences in scheduling and underlying performance. For example, scheduling performance, the scheduler will only use request for scheduling, that is, no matter how big the limit is, it will not be used for scheduling.

At the bottom, different Qos behaves differently. For example, CPU, it is divided by request weight, different QoS, its request is completely different, such as Burstable and BestEffort, it may be request can fill in a very small number or not, in this case, its time slice weight is actually very low. Like BestEffort, its weight may only be 2, while for Burstable or Guaranteed, its weight can be as many as a few thousand.

In addition, when we turn on a feature of kubelet, called cpu-manager-policy=static, we Guaranteed Qos, if its request is an integer, for example, with 2, it will bind the Guaranteed Pod. Specifically, like the following example, it assigns CPU0 and CPU1 to Guaranteed Pod.

For non-integer Guaranteed/Burstable/BestEffort, their CPU will be put together to form a CPU share pool. For example, if this node has 8 cores and has assigned 2 cores to integer Guaranteed, then the remaining 6 core CPU2~CPU7 will be shared by non-integer Guaranteed/Burstable/BestEffort, and then they will divide time slices according to different weights to use 6-core CPU.

In addition, on the memory, the OOMScore is also divided according to different QoS. For example, Guaranteed Pod will be fixed with default-998 OOMScore;, while Burstable Pod will allocate 2-999 OOMScore;BestEffort Pod according to the designed size of Pod memory and the proportion of node memory. If the OOMScore,OOMScore score of 1000 is fixed, the higher the OOMScore,OOMScore score, the higher the kill will be dropped when the physical machine has OOM.

In addition, in the eviction action on the node, different QoS behaviors are also different. For example, when eviction occurs, priority will be given to expelling the pod of BestEffort. So the performance of different QoS at the bottom is very different. This in turn requires us to configure the Limits and Requests of resources according to the requirements and attributes of different businesses in the production process, so as to plan QoS Class reasonably.

Resource Quota

We will also encounter a scenario in production: if the cluster is submitted by multiple people at the same time, or if multiple businesses are used at the same time, we must limit the total amount of submissions by a business or a person to prevent the resources of the entire cluster from being used by one business, resulting in no resources for another business.

Kubernetes provides us with a capability called ResourceQuota. It can limit the use of namespace resources.

The specific approach is shown in the yaml on the right side of the figure, and you can see that its spec includes a hard and a scopeSelector. Hard content is actually very similar to Resource, here you can fill in some basic resources. But it's a little richer than Resource list, and you can fill in some Pod, which limits the number of Pod. In addition, scopeSelector provides richer indexing capabilities for this ResourceQuota.

For example, in the above example, the non-BestEffort pod is indexed, the limit of cpu is 1000, and the memory is 200G and the pod is 10.

In addition to providing NotBestEffort, ScopeName also provides a richer range of indexes, including Terminating/Not Terminating,BestEffort/NotBestEffort,PriorityClass.

When we create such a ResourceQuota to act on the cluster, if the user really uses more resources, the behavior is as follows: when it submits the Pod spec, it will receive a forbidden 403 error that prompts exceeded quota. In this way, users can no longer submit resources that are super to the application.

If you submit another resource that is not included in the ResourceQuota, it will be successful.

This is the basic use of ResourceQuota in Kubernetes. We can use the ResourceQuota method to limit the amount of resources used by each namespace, thus ensuring the use of resources by other users.

Summary: how to meet the Pod resource requirements?

The above describes the use of basic resources, that is, we have achieved how to meet the requirements of Pod resources. Let's make a summary:

Pod should configure reasonable resource requirements.

CPU/Memory/EphemeralStorage/GPU

Select different Pod for different business characteristics through Request and Limit

Guaranteed: sensitive, requiring business guarantee

Burstable: sub-sensitive, flexible business is required

BestEffort: tolerance service

Configure ResourceQuota for each NS to prevent overuse and ensure that other people's resources are available

How to meet the requirements of Pod-Pod relationship?

Next, let's introduce the relational scheduling of Pod, first of all, the relational scheduling of Pod and Pod. We may encounter some scenarios in our daily use: for example, a Pod must be placed with another Pod, or it cannot be placed with another Pod.

Under this requirement, Kubernetes provides two types of capabilities:

The first type of capability is called Pod affinity scheduling: PodAffinity

The second category is Pod anti-affinity scheduling: PodAntAffinity.

Pod affinity scheduling

First, let's look at Pod affinity scheduling. If I want to put one Pod and another Pod together, we can look at the example in the figure above, fill in podAffinity, and then fill in the required requirement.

In this example, you have to schedule to the node where the Pod with key: K1 is located, and the fragmentation granularity is to break up the index according to the node granularity. In this case, if the Pod node with key: K1 can be found, the schedule will be successful. If there are no such Pod nodes in the cluster, or if there are not enough resources, the scheduling will fail. This is a strict affinity scheduling, which we call mandatory affinity scheduling.

Sometimes we don't need such a strict scheduling policy. At this point, you can change required to preferred and become a priority affinity scheduling. That is, priority can be given to scheduling the Pod node with key: K2. And this preferred can be a list selection, you can fill in a number of conditions, such as weight equal to 100 is key: K2, weight equal to 10 is key: K1. When scheduling, the scheduler will first assign the Pod to the scheduling condition node with higher weight.

Pod anti-affinity scheduling

The above introduction of affinity scheduling, anti-affinity scheduling and affinity scheduling is similar, the function is reversed, but the syntax is basically the same. Only podAffinity has been replaced with podAntiAffinity, which also includes required mandatory anti-compatibility, and a preferred priority anti-affinity.

Here are two examples: one is to disable scheduling to the Pod node with the key: K1 tag, and the other is to give priority to anti-affinity scheduling to the Pod node with the key: K2 tag.

In addition to the Operator syntax of In, Kubernetes also provides more rich syntax combinations for everyone to use. For example, the combination of In/NotIn/Exists/DoesNotExist. The example above uses In, for example, in the first example of forced anti-affinity, which means that we have to disable scheduling to the Pod node with the key: K1 tag.

The same function can also be used Exists,Exists range may be larger than In range, when Operator filled in Exists, there is no need to enter values. Its effect is to prohibit scheduling to the node where the Pod is located with the key: K1 tag. No matter what the value of values is, as long as the node with the key tag K1 is located, it cannot be dispatched.

This is the scheduling of the relationship between Pod and Pod.

How to satisfy the scheduling relationship between Pod and Node

The relationship scheduling between Pod and Node, also known as Node affinity scheduling, mainly introduces two kinds of usage methods.

NodeSelector

The first category is NodeSelector, which is a relatively simple usage. For example, there is a scenario where Pod must be dispatched to a Node with a K1: V1 tag, and a nodeSelector requirement can be entered in the spec of Pod. NodeSelector is essentially a map structure, which can directly write the requirements for the node tag, such as K1: V1. In this way, my Pod will be forced to be dispatched to the Node with the K1: V1 tag.

NodeAffinity

NodeSelector is a very simple usage, but there is a problem with this usage: it can only enforce affinity scheduling, and if I want to schedule first, I can't do it with nodeSelector. So the Kubernetes community added a new usage called NodeAffinity.

Similar to PodAffinity, it also provides two types of scheduling strategies:

The first category is required, which must be dispatched to a certain type of Node.

The second category is preferred, which gives priority to dispatching to a certain type of Node.

Its basic syntax is similar to PodAffinity and PodAntiAffinity above. On Operator, NodeAffinity provides richer Operator content than PodAffinity. Added Gt and Lt, the use of numerical comparison. When using Gt, values can only fill in numbers.

Node tagging / tolerance

There is also a third type of scheduling, which can restrict Node scheduling to certain Node by marking Pod. Kubernetes calls these marks Taints, which literally means pollution.

So how do we restrict Pod scheduling to certain Node? For example, there is a node called demo-node, and there is a problem with this node. I would like to restrict some Pod scheduling. At this point, you can type a taints,taints for this node, including key, value, and effect:

Key is the configured key value.

Value is the content.

Effect marks what the taints behavior is.

There are currently three taints behaviors in Kubernetes:

NoSchedule forbids new Pod dispatching.

PreferNoSchedul tries not to dispatch to this station.

NoExecute will evict that there is no Pods corresponding to toleration, and it will not schedule new ones. This strategy is very strict, so we should be careful when using it.

As shown in the green part of the image above, type k1=v1 to the demo-node, and effect equals NoSchedule. Its effect is that the newly created Pod does not specifically tolerate this taint, so it cannot be dispatched to this node.

If some Pod can be dispatched to this node, what should be done? At this point, you can type a Pod Tolerations on Pod. You can see from the blue part of the image above: enter a Tolerations in the spec of Pod, which also contains key, value, and effect. These three values correspond exactly to the values of taint. What is the key,value,effect in taint? you should also enter the same content in Tolerations.

Tolerations also has an option Operator,Operator has two value:Exists/Equal. The concept of Equal is that you must enter value, and Exists, like the NodeAffinity mentioned above, does not need to enter value. As long as the value of key matches, it is considered to match taints.

In the example in the figure above, a Tolerations is typed to Pod. Only if you type the Pod of this Tolerations, can you schedule to the Node with taints in the green part. The advantage is that Node can selectively schedule some Pod, but not all Pod can be dispatched, thus limiting the scheduling of some Pod to some Node.

Summary

We have finished introducing the special relationship and conditional scheduling of Pod/Node. Let's make a summary.

First of all, if there is a need to deal with Pod and Pod, for example, Pod and another Pod have an affinity or mutually exclusive relationship, you can configure them with the following parameters:

PodAffinity

PodAntiAffinity

If there is an affinity between Pod and Node, you can configure the following parameters:

NodeSelector

NodeAffinity

If some Node restricts certain Pod scheduling, such as some faulty Node, or some special business Node, you can configure the following parameters:

Node-Taints

Pod-Tolerations

Advanced scheduling capabilities of Kubernetes

After introducing the basic scheduling capabilities, let's take a look at the advanced scheduling capabilities.

Priority scheduling

Priority scheduling and preemption, the main concepts are:

Priority

Preemption

First of all, let's take a look at the four characteristics mentioned in the scheduling process. How can we make rational use of the cluster? When the cluster resources are sufficient, the reasonable use mode can be combined only through the basic scheduling capacity. But if there are not enough resources, how can we make rational use of the cluster? There are two types of common strategies:

First-come-first-served (FIFO) strategy-simple, relatively fair, quick to use

Priority strategy (Priority)-more in line with the daily business characteristics of the company

In actual production, if you use the first-come-first-served strategy, it is an unfair strategy, because there must be high-priority business and low-priority business in the company's business. therefore, the priority strategy will be more in line with the daily business characteristics of the company than the first-come-first-served strategy.

Next, let's introduce the concept of priority scheduling under the priority policy. For example, if one Node is already occupied by a Pod, this Node has only two CPU. When another high priority Pod comes, the low priority Pod should give these two CPU to the higher priority Pod to use. The low-priority Pod needs to go back to the waiting queue, or the business is resubmitted. Such a process is a process of priority preemptive scheduling.

In Kubernetes, PodPriority and Preemption, the priority and preemption features, became stable in v1.14. And the PodPriority and Preemption functions are enabled by default.

How to use priority scheduling configuration?

How to use priority scheduling? You need to create a priorityClass, and then configure a different priorityClassName for each Pod, thus completing the configuration of priority and priority scheduling.

Let's first take a look at how to create a priorityClass. Two demo are defined on the right side of the figure above:

One is to create a priorityClass called high, which is a high priority with a score of 10000

The other creates a priorityClass called low, which has a score of 100.

At the same time, in the third part, we configure Pod1 with low priorityClassName on high,Pod2, and the blue part shows the configuration location of pod's spec, that is, fill in a priorityClassName: high in spec. After Pod and priorityClass are configured, a priorityClass scheduling is enabled for the cluster.

Built-in priority configuration

Of course, Kubernetes also has a default priority built into it. For example, DefaultpriorityWhenNoDefaultClassExistis, if DefaultpriorityWhenNoDefaultClassExistis is not configured in the cluster, then all Pod values for this item will be set to 0.

The maximum priority that the user can configure is limited to: HighestUserDefinablePriority = 10000000000 (1 billion), which is less than the system-level priority: SystemCriticalPriority = 20000000000 (2 billion)

There are two system-level priorities built in:

System-cluster-critical

System-node-critical

This is the priority configuration built into K8S priority scheduling.

Priority scheduling process

Here is a simple priority scheduling process:

First, we introduce the process that only triggers priority scheduling but does not trigger preemptive scheduling.

If a Pod1 and Pod2,Pod1 are configured with a high priority, the Pod2 is configured with a low priority. Submit both Pod1 and Pod2 to the scheduling queue.

When the scheduler processes the queue, it selects a high-priority Pod1 for scheduling, and binds the Pod1 to the Node1 through the scheduling process.

Second, select a low-priority Pod2 to do the same process and bind to the Node1.

This completes a simple process of priority scheduling.

Priority preemption process

What kind of process would it be if a high-priority Pod had no resources when scheduling?

First of all, it is the same scenario as above, but the Pod0 is placed on the Node1 in advance, which takes up some resources. There are also Pod1 and Pod2 to be scheduled, and Pod1 has a higher priority than Pod2.

If you schedule Pod2 first, it is bound to Node1 through a series of scheduling processes.

Then schedule the Pod1, because there are already two Pod on the Node1 and the resources are insufficient, so you will encounter a scheduling failure.

When the Pod1 enters the preemption process when the scheduling fails, the nodes of the entire cluster are filtered. Finally, the Pod to be preempted is Pod2, and the scheduler removes the data from the Node1 from the Pod2.

Then dispatch Pod1 to Node1. In this way, the process of preemptive scheduling is completed.

Priority preemption strategy

Next, we introduce the specific preemption strategy and preemption process.

On the right side of the figure is the entire kube-scheduler priority preemptive scheduling process. First of all, when a Pod enters preemption, it will determine whether Pod is qualified for preemption, which may have been preempted last time. If it is eligible for preemption, it will first filter all the nodes to filter out the nodes that meet the preemption requirements, and filter out the nodes if they do not meet the preemption requirements.

Then, from the remaining nodes of the filter, select the appropriate nodes to preempt. This preemption process will simulate a scheduling, first remove the Pod with low priority above, and then put the Pod to be preempted on this node. Then select a batch of nodes through this process and move on to the next process ProcessPreemptionWithExtenders. This is an extended hook, users can add some of their own preemptive node strategy, if there is no extended hook, there is no action.

The next process is called PickOneNodeForPreemption, which is to select the most appropriate node from the above selectNodeForPreemption list, which has a certain strategy. The strategy is briefly introduced on the left side of the figure above:

Give priority to the node that breaks the least PDB

Secondly, select the node with the highest priority and the lowest priority in the preemptive Pods.

Again, select the node that has the lowest priority sum to preempt Pods.

Next, select the node with the least number of Pods to be preempted.

Finally, select the node that has the latest start of Pod

After filtering through the five-step serial policy, the most appropriate node is selected. Then delete the Pod to be preempted on this node, thus completing a process of preemption.

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.