What are Kubernetes scheduler's learning notes like? 04/18 Update SLTechnology News&Howtos

What are Kubernetes scheduler's learning notes like?

2025-04-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

Kubernetes scheduler learning notes how to solve this problem, this article introduces the corresponding analysis and solutions in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible way.

Brief introduction

Kubernetes is a powerful orchestration tool, which can be used to manage many machines conveniently. In order to improve the resource utilization of machines, but also distribute the pressure to each machine as much as possible, this responsibility is done by scheduler.

Kubernetes scheduler is a policy-rich, topology-aware, workload-specific feature that significantly affects availability, performance, and capacity.

In order to make better use of it, so from the source point of view, it is a comprehensive analysis and learning.

Scheduler does not have many functions, but its logic is complex, and there are many factors to consider. To sum up, there are the following points:

Leader selects the master to ensure that only one scheduler is working in the cluster and that the rest are only highly available backup instances. Use endpoint:kube-scheduler as an arbitration resource.

Node filtering, according to the set conditions, resource requirements, etc., to match all the Node nodes that meet the allocation.

Optimal Node selection. Of all the Node that meet the criteria, the ones with the highest score are graded according to the defined rules. If there is the same score, the polling method will be used.

In response to high-priority resource allocation, preemption is added. Scheduler has the right to delete some low-priority Pod to release resources to the higher-priority Pod for use.

Function description

The code looks difficult. Here are several scenarios to describe the working process of scheduler:

1. Environment description (3 machines are assumed to be 8C16G)

Scenario 1: resource allocation-the most basic function

2. First assign a Pod:A requesting 2C4G

Scenario 2: machine load balancing-scoring mechanism

3. Allocate another Pod:B requesting 2C4G (although there are free resources on node1 to allocate B, node2 and node3 have more free resources and higher scores, so they are allocated to node2. )

4. By the same token, if you assign another Cpene organizer, it will be assigned to the node3 first.

Scenario 3: resource preemption-privilege mechanism

5. 2C4G is assigned to all three Node, that is, the remaining 6C12G. If I assign a Pod:D of 8C12G at this time, D will not be assigned at the same priority and will be in Pending status, because all three machines have insufficient resources.

6. If I give D a high priority at this time, schedule will delete the Pod on a machine, such as A, and then assign D to node1 and A to node2 or node3 if there are enough resources. (the distribution here is similar, because all three are the same.)

7. The following is an actual combat to test the preemption process of scheduler:

I have a Deployment and three copies, each of which is assigned to two machines. (the reason for using this example is to show that preemption must occur on 10-10-40-89, because there are the least Pod to be deleted.)

At this point, I create a high priority Deployment:

For a quick query, you can see the following stages:

In the first step, the testpc-745cc7867-fqbp2 to be assigned is set to "nominate Pod". The name will appear again, and the testpod on the original 10-10-40-89 will be deleted. Due to the slow cut, the new testpod in the following figure has been created on 10-10-88-99.

In the second step, the nominated Pod will be assigned to the corresponding node (wait for the Pod in the Terminating state to release resources).

The third step is that if the resources are sufficient, the Pod is normal Running.

Finally, the events in the case of watch are shown:

I have two yaml files for the test, as follows:

Testpod.yaml:

ApiVersion: extensions/v1beta1kind: Deploymentmetadata: annotations: deployment.kubernetes.io/revision: "1" labels: k8s-app: testpod name: testpodspec: progressDeadlineSeconds: 600 replicas: 3 revisionHistoryLimit: 10 selector: matchLabels: k8s-app: testpod template: metadata: labels: k8s-app: testpodspec: containers:-image: nginx:1.17 imagePullPolicy: IfNotPresent name: nginx ports: -containerPort: 80 name: nginx protocol: TCP resources: requests: cpu: 1 memory: 2Gi

Testpc.yaml:

ApiVersion: scheduling.k8s.io/v1beta1kind: PriorityClassmetadata: name: high-priorityvalue: 1000000000globalDefault: false---apiVersion: extensions/v1beta1kind: Deploymentmetadata: annotations: deployment.kubernetes.io/revision: "1" labels: k8s-app: testpc name: testpcspec: progressDeadlineSeconds: 600 replicas: 1 revisionHistoryLimit: 10 selector: matchLabels: k8s-app: testpc template: metadata: labels: k8s-app: testpcspec: containers:- Image: nginx:1.17 imagePullPolicy: IfNotPresent name: nginx ports:-containerPort: 80 name: nginx protocol: TCP resources: requests: cpu: 6 memory: 2Gi priorityClassName: high-priority

Scenario 4: relationship households-- affinity and anti-affinity

Scheduler in the allocation of Pod, consider a lot of elements, affinity and anti-affinity, is a more commonly used, here to do a typical talk.

For example, in the picture above, my new Pod:D requires that you cannot be on the same machine with A, with a mutex score of 100 with B and a mutex score of 10 with C. Said that D and A must not be in the same machine, as far as possible not with B, C in the same machine, there is really no way (lack of resources), D is more inclined to be with C.

Example:

PodAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution:-weight: 100podAffinityTerm: labelSelector: matchExpressions:-key: security operator: In values:-S2 topologyKey: kubernetes.io/hostname

Through the analysis of these four application scenarios, we have a preliminary understanding of its function. To get a more comprehensive and in-depth understanding of its function, we need to start with its source code. The following will do in-depth analysis from the source code level.

The overall structure of Code Analysis scheduler

The configuration of scheduler is basically the default configuration, and the figure lists its configuration loading process, basically loading its own default configuration.

Server.Run is its principal logic, which will be explained in more detail later.

Important configuration explanation

In the figure, two config configurations are listed separately:

1 、 disablePreemption:

Scheduler has a preemption feature. When Pod scheduling finds that no resources are available, it deletes the Pod with a lower priority than the Pod to release resources for it to schedule. DisablePreemption defaults to false, which means preemption is enabled. If it needs to be disabled, it is set to true.

2. Since priority is mentioned, I have also listed the method of setting priority.

There is a separate priority resource in Kubernetes called PriorityClass, and with the following yaml, you can create a PriorityClass.

ApiVersion: scheduling.k8s.io/v1kind: PriorityClassmetadata: name: high-priorityvalue: 1000000globalDefault: falsedescription: "This priority class should be used for XYZ service pods only."

You can then associate the PriorityClass with the Pod:

ApiVersion: v1kind: Podmetadata: name: nginx labels: env: testspec: containers:-name: nginx image: nginx imagePullPolicy: IfNotPresent priorityClassName: high-priority

This completes the setting of the Pod priority. If not set, Pod defaults to the same priority (0).

Pay special attention to:

Static Pod is special, so you need to set priority directly, because kubelet is judged by priority.

Scheduler startup process

Through an in-depth analysis of server.Run, you can see the following process:

Server.Run still has a part of the configuration process.

In schedulerConfig, two large chunks of content are loaded according to the default parameters: predicate and priority functions.

The predicate function is used to check whether the Pod can be assigned to the Node.

The priority function is used to select the best. When there is more than one Node that can be allocated, the node is graded according to the priority function, and finally dispatched to the Node with the highest score.

Kubernetes provides these default judgment functions:

Predicate:

1 、 CheckNodeConditionPredicate

We really don't want to check predicates against unschedulable nodes.

Check the Node status: whether it is in a schedulable state, etc.

-> iterate through all the conditions of Node in nodeInfo:

If the Node type is ready and the state is not True, the node is considered to be notReady

Node OutOfDisk is considered if the Node type is OutOfDisk and the state is not False

If the Node type is NetworkUnavailable and the state is not False, the node state is considered to be: NetworkUnavailable

Check the spec of Node, and if it is UnSchedulable, the node is considered to be UnSchedulable.

If all the above checks are passed, the match is returned.

2 、 PodFitsHost

We check the pod.spec.nodeName.

Check if the pod.spec.nodeName matches.

-> if Pod does not specify NodeName, a match is returned.

Check the name of Node. If it has the same name as the one specified by Pod, the match is successful. Otherwise, it returns: nodeName does not match.

3 、 PodFitsHostPorts

We check ports asked on the spec.

Check to see if the service port is occupied.

-> if the required podPorts is defined in the metadata metadata, it is taken directly from the metadata, otherwise the required port is obtained from all the containers of the Pod.

If the required port is empty, a match is returned.

Get the currently used port from nodeInfo. If there is a conflict, return: the port does not match, otherwise the match is successful.

4 、 PodMatchNodeSelector

Check node label after narrowing search.

Check if the label matches.

-> if NodeSelector is defined in Pod, the labels of Node is matched according to the selection. If it does not match, NodeSelectorNotMatch is returned.

If NodeAffinity is defined in the Affinity of Pod, check the node affinity:

If no requiredDuringSchedulingIgnoredDuringExecution is defined, a match is returned directly.

If a requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms is defined, there is a match, then a match. Otherwise, it is considered as a mismatch.

Special: if nodeSelectorTerms is nil, it does not match; if nodeSelectorTerms is not nil, but empty slices, it does not match; similarly, MatchExpressions in nodeSelectorTerms, if nil or empty slices, does not match.

5 、 PodFitsResources

This one comes here since it's not restrictive enough as we do not try to match values but ranges.

-> check whether the allowedPodNumber of Node exceeds, and if so, add an out-of-limit error (not returned directly here, all errors will be returned at once after checking).

Check whether podRequest and ignoredExtendedResources are defined in the metadata, and if so, take them from the metadata. Otherwise, take it from each container in the Pod: first check the sum of all the required resources in all the container, and then check the initContainer. If any resources are larger than the total, take the larger ones as the required resources.

If all required resources are 0, the check result is returned.

Get the available resources of Node, check whether the newly requested resources + requested resources exceed the available resources, and if so, the record resources are insufficient.

Check all Pod extended resources, and determine whether the extended resources need to be checked (ignoredExtendedResources). If so, determine whether the resources are sufficient, and if not, record the failure.

Returns the check result (if there is no failure, the check is successful).

6 、 NoDiskConflict

Following the resource predicate, we check disk.

-> iterate through all storage in Pod and all Pod under Node to check if there are storage conflicts:

If Pod has no storage (no GCE, AWS, RBD, ISCSI), the check passes.

7 、 PodToleratesNodeTaints

Check toleration here, as node might have toleration.

-> check whether the node tolerates the taint environment:

Parameter: tolerance rule defined in Pod: environment status in tolerations,Node: taints, filtering rule: take effect as NoSchedule or NoExecute.

If Node has no taints, a match is returned.

Traverses all taints and skips checking if the taint does not meet the filtering rules.

Iterate through all the tolerance rules to see if there are any rules that allow the taint state of the node. Check the steps:

If effect is empty, the check passes, otherwise it will be the same.

If key is empty, the check passes, otherwise it will be the same.

If operator is Exists, the check passes, if it is empty or Equal, it must be the same, otherwise it will not pass.

8 、 PodToleratesNodeNoExecuteTaints

Check toleration here, as node might have toleration.

-> the inspection rule is similar to the above, except that the filtering rule has changed: take effect as NoExecute.

9 、 CheckNodeLabelPresence

Labels are easy to check, so this one goes before.

-> check whether label exists, regardless of the value. Label can be set to exist or not.

This check is initialized only in scheduler.CreateFromConfig (policy), registered in RegisterCustomFitPredicate, and there is no such check by default.

10 、 checkServiceAffinity

-> check the service similarity.

If a Pod service is scheduled to a Node with label: "region=foo", then Pod with the same service will be dispatched to that Node.

11 、 MaxPDVolumeCountPredicate

-> check whether the number of mounted volumes exceeds the standard. Only ESB:39,GCE:16,AzureDisk:16 is supported.

12 、 VolumeNodePredicate

-> none

13 、 VolumeZonePredicate

-> check the storage area partition:

Check for label:failure-domain.beta.kubernetes.io/zone or failure-domain.beta.kubernetes.io/region in Node, and if so, check Pod storage.

Traverse the storage information required by Pod:

Get the PVC information according to the PVC name, and take out the PV name corresponding to the PVC. If there is no name (indicating that the PV has not been bound yet), get the StorageClassName of the PVC. If the process is being bound, skip not checking, otherwise a match failure is returned (because the PVC binding failed).

If the binding is successful, obtain the corresponding PV information according to the pvName, and check the tag of the PV. If the PV has the above two tags (zone and region), check whether the values of the PV (there may be multiple values separated by _ _) contain the values of the corresponding Node tags. If not, a match failure is returned.

14 、 CheckNodeMemoryPressurePredicate

Doesn't happen often.

-> check Node memory pressure.

15 、 CheckNodeDiskPressurePredicate

Doesn't happen often.

16 、 InterPodAffinityMatches

Most expensive predicate to compute.

These scoring functions (priority) are available by default:

SelectorSpreadPriority: split according to the same RC and services, so that each Node has the same service or RC Pod as little as possible, spreads pods by minimizing the number of pods (belonging to the same service or replication controller) on the same node.

InterPodAffinityPriority: pods should be placed in the same topological domain (e.g. Same node, same rack, same zone, same power domain, etc.) according to the commonness of Pod.

LeastRequestedPriority: choose the more idle node,Prioritize nodes by least requested utilization.

BalancedResourceAllocation: consider allocation in terms of the balance of resource allocation, Prioritizes nodes to help achieve balanced resource usage.

NodePreferAvoidPodsPriority: it is used for custom assignment by users. The weight starts from 10000. It is convenient for users to specify. It doesn't work at 0. The user specifies by this: scheduler.alpha.kubernetes.io/preferAvoidPods Set this weight large enough to override all other priority functions.

NodeAffinityPriority: allocate according to the node relationship, Prioritizes nodes that have labels matching NodeAffinity.

TaintTolerationPriority: allocate according to the tolerance item set by pod, Prioritizes nodes that marked with taint which pod can tolerate.

Finally, the endless loop enters: scheduleOne, which really starts the scheduling process of schedule.

Schedule scheduling process

Let's start with the main process:

The main process consists of the following 8 steps:

Take a Pod that needs to be scheduled from the Pod queue.

Try to schedule the Pod.

If the scheduling fails, attempt to preempt the Pod.

After the schedule is successful, try to do volumes binding.

Because the reserve plug-in is not enabled for the time being, it is not analyzed yet.

Try to assign Pod to Node.

Really implement the binding. In steps 4 and 6, you only operate on the cache of schedule. First, make sure that the operation on cache is completed. Finally, in step 7, the exception implementation applies the changes in cache to apiserver. If the application fails, the allocation information of the pod is cleared from the cache and the scheduler is redone.

The most complex and core are steps 2 and 3, which are analyzed below.

Scheduling Pod process

Scheduling Pod is an attempt to assign Pod to Node. The process is as follows:

There are seven points, which will be analyzed step by step:

The basic check of Pod is to check whether Pod has a corresponding PVC. Here you just check whether PVC exists and do not care about the binding process.

Take out all Node lists.

Apply nodeInfo to the cache. The real data information of the current Node is saved in the global nodeInfo, and there is information about the hypothetical analysis of the scheduling process in the cache.

Check whether Pod can be dispatched to Node and return a list of schedulable Node.

A) the check here is aimed at the predicate function registered during the previous initialization. If there is any discrepancy, it is considered unschedulable.

B) two attempts will be made here, twice because of the existence of the "nominating Pod". No matter where the "nominating Pod" comes from for the time being, we will talk about it later. Nominate the Pod, that is, the Pod has been assigned to the Node, but it has not been applied to the Kubernetes environment. Currently, it only occupies this pit. The Pod to be scheduled also needs to consider the resources it occupies during the scheduling process.

C) for the first time, the nominated Pod with a priority higher than the current Pod is assigned to the Node (added to a temporary nodeInfo), and then all predicat functions are checked.

D) the second time, do not add the nomination Pod, and then check all the predicate functions. The reason for the second time is that the nomination Pod does not actually exist, and some Pod affinity may be misjudged.

E) of course, if there is no nomination for Pod, there is no need for a second judgment

If it cannot be found, a failure is returned. If only one is found, the Node is returned.

When more than one Node is found, the Node will be graded. The rules are as follows:

A) if no scoring rules are defined, all scores are returned as 1. Schedule has a scoring function by default, as mentioned in the previous initialization.

B) run the previous version of the scoring function. In the early stage, it was a simple function, and after running it, it got a score.

C) the new version, split the scoring function into two steps, map and reduce, first run map according to 16 concurrency, and then run reduce to calculate the results.

D) extended support is also reserved here.

E) finally return the scoring result.

According to the result of the score, select Node.

A) take out the list of Node with the highest score first.

B) then select Node as round-robin.

Since there may be multiple Node with the same highest score, genericScheduler uses round-robin method: it records a global lastNodeIndex by itself, and how num is the current number of nodes with the same highest score, then use lastNodeIndex% num to select the subscript of this node, and then lastNodeIndex plus 1 to achieve polling scheduling.

At this point, the scheduling process analysis of Pod is completed. There is one special thing: nominating Pod (NominatedPod), which is related to the preemption process described below.

Pod preemption process

The process of preemption is more complex than scheduling, which is mainly divided into two steps: preemption analysis and preemption. The first step is to check to see if preemption can be completed, and the second step is to perform preemption (delete Pod).

Preemption check

Check whether Pod can initiate preemption: preemption is not allowed if Pod is a nominated Pod (pre-assigned to Node), and there is Pod p in terminating on the Node, and p's priority is less than the current Pod.

Get all Node lists.

Get the possible Node. Check the reason for the scheduling failure, and if it is the case of nodeNotReady, Node does not participate in preemption. These are not involved in preemption: predicates.ErrNodeSelectorNotMatch,predicates.ErrPodAffinityRulesNotMatch,predicates.ErrPodNotMatchHostName,predicates.ErrTaintsTolerationsNotMatch,predicates.ErrNodeLabelPresenceViolated,predicates.ErrNodeNotReady,predicates.ErrNodeNetworkUnavailable,predicates.ErrNodeUnderDiskPressure,predicates.ErrNodeUnderPIDPressure,predicates.ErrNodeUnderMemoryPressure,predicates.ErrNodeUnschedulable,predicates.ErrNodeUnknownCondition,predicates.ErrVolumeZoneConflict,predicates.ErrVolumeNodeConflict,predicates.ErrVolumeBindConflict

If there is no preemptive Node, it ends.

Get the pdb list: pdb is PodDisruptionBudget. This is the definition of budget, for example, statefulset defines three copies, and we define to allow one of the Pod to fail.

Gets the list of Pod that can be scheduled by preemption (removing some Node).

A) remove all Pod with lower priority than the current Pod from the nodeInfoCopy, and then try to schedule. B) if the scheduling fails, it cannot be preempted. (because the higher priority cannot be deleted) c) the Pod to be deleted is split according to pdb: nonViolatingVictim and violatingVictim. The description is shown in the picture. D) then try to add the Pod in violatingVictim one by one to see if it can be scheduled. The failed count is recorded in numViolatingVictim. E) then try to add the Pod in nonViolatingVictim one by one to see if it can be scheduled. Victims records failed Pod information. F) returns victims and numViolatingVictim.

The extenders extension is reserved.

From the list of Node available for preemption, select the most appropriate Node. Select according to the following rules: a) minimum node pdb violations. Is the numViolatingVictimb returned above) if only one Node is satisfied, the Nodec is returned) compare the highest priority value of the victims in the Node and take the lowest one. Highest: take the highest value of priority in a single Node. Minimum: takes the lowest value of all Node d) if there is only one, the Node is returned. E) take the lowest sum of victims priorities in Node. F) if there is only one, the Node is returned. G) take the lowest Pod number of victims in Node. H) return the first.

If there is no suitable one, it is over.

Get the nomination Pod that is lower than the current priority.

Return the Node information, the list of Pod to be deleted, and the nominated Pod with low priority.

At this point, the preemption check is over. Get the Node of the desired schedule, the list of Pod that needs to be deleted to schedule to this Node, and the nominated Pod that has a lower priority than the current Pod.

Preempt the execution process (you won't enter until you find the desired Node)

Change the current Pod to the nominated Pod, and the corresponding Node to the desired Node. This is why the nominee Pod appeared. Update the nomination Pod information to apiServer. Traverse the victims (the list of Pod that needs to be deleted returned by the preemption process), delete the Pod, and record the event. Iterate through the nominatedPodsToClear (the nominated Pod returned by preemption is lower than the current Pod priority), clear the nominated Pod configuration, and update the apiServer.

At this point, the scheduling process analysis is completed.

The answers to the questions about Kubernetes scheduler study notes are shared here. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel to learn more about it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.