How to understand the scheduling flow and algorithm of k8s scheduler 07/11 Update SLTechnology News&Howtos

How to understand the scheduling flow and algorithm of k8s scheduler

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article shows you how to understand the scheduling process and algorithm of K8s scheduler, which is concise and easy to understand, which will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

Overview of scheduling process

Kubernetes is the most popular automatic container OPS platform. Kube-scheduler, as the core component of container orchestration of K8s, will be the protagonist I introduce today. The versions described below are all based on * * release-1.16 * *. The following are the major components of kube-scheduler:

Policy

Scheduler's scheduling policy startup configuration currently supports three ways, configuration file / command line parameter / ConfigMap. The scheduling policy can be configured to specify which filters (Predicates), Priorities (Priorities), external extended scheduler (Extenders), and custom extension points (Plugins) of the newly supported SchedulerFramwork are to be used in the scheduling main process.

Informer

When starting, Scheduler uses the informer mechanism of K8s to List+Watch to obtain the data needed for scheduling from kube-apiserver, such as Pods, Nodes, Persistant Volume (PV), Persistant Volume Claim (PVC) and so on, and preprocesses these data as the Cache of the scheduler.

Dispatching pipeline

Insert the Pod that needs to be scheduled into the Queue through Informer, and the Pipeline will loop from the Pod waiting to be scheduled by the Queue Pop into the Pipeline for execution.

The scheduling pipeline (Schedule Pipeline) has three main phases: Scheduler Thread,Wait Thread,Bind Thread.

Scheduler Thread phase: from the architecture diagram above, you can see that Schduler Thread goes through Pre Filter- > Filter- > Post Filter- > Score-> Reserve, which can be simply understood as Filter- > Score-> Reserve.

The Filter phase is used to select the Nodes;Score phase that meets the Pod Spec description and is used to score and sort the Nodes after the Filter. The Reserve phase compares the Pod with the NodeCache of the sorted optimal Node, indicating that the Pod has been assigned to this Node, so that the next Pod waiting for scheduling can see the Pod just assigned when Filter and Score the Node.

* * Wait Thread phase: * * this phase can be used to wait for the Ready wait of the resources associated with Pod, such as waiting for the PV of PVC to be created successfully, or waiting for the associated Pod scheduling to succeed in Gang scheduling, etc.

* * Bind Thread phase: * * it is used to persist Kube APIServer for the association between Pod and Node.

The whole scheduling pipeline is only scheduled by a Pod and a Pod which is serial in the Scheduler Thread phase, and the Pod is executed asynchronously in both Wait and Bind phases.

Detailed scheduling process

After explaining the functions and relationships of several major components of kube-scheduler, let's have an in-depth understanding of the specific working principles of Scheduler Pipeline. The following is the detailed flow chart of kube-scheduler, first explaining the scheduling queue:

* * SchedulingQueue * * has three sub-queues activeQ, backoffQ, and unschedulableQ.

When Scheduler starts, all Pod waiting to be scheduled will enter activieQ,activeQ and sort according to Pod's priority. Scheduler Pipepline will obtain a Pod from activeQ for Pipeline to execute the scheduling process. When scheduling fails, it will directly choose to enter unschedulableQ or backoffQ according to the situation. If there are changes in Node Cache, Pod Cache and other Scheduler Cache during the current Pod scheduling period, enter backoffQ, otherwise enter unschedulableQ.

UnschedulableQ will regularly brush into activeQ or backoffQ for a long time (for example, 60 seconds), or trigger the associated Pod to be brushed into activeQ when the Scheduler Cache changes, or backoffQ;backoffQ will allow the scheduled Pod to enter activeQ for rescheduling faster than unschedulableQ with the backoff mechanism.

Then, the Scheduler Thread phase is introduced in detail. If you get a Pod waiting for scheduling in Scheduler Pipeline, you will get the relevant Node from NodeCache to perform Filter logic matching. There is a spatial algorithm optimization in the process of NodeCache traversing Node, which can be summarized as considering the scheduling of disaster recovery sampling scheduling while avoiding filtering all nodes.

Specific optimization algorithm logic (interested students can see node_tree.go 's Next method): in NodeCache, Node is stacked according to zone. During the filter phase, a zondeIndex is maintained for the NodeCache, one Node per Pop is filtered, the zoneIndex moves back one position, and then a node is extracted from the node list of the zone.

You can see that there is a nodeIndex on the vertical axis of the image above, which increases itself each time. If the node of the current zone has no data, it will take the data from the next zone. The general process is that zoneIndex is from left to right and nodeIndex from top to bottom, ensuring that the Node nodes are scattered according to zone, so as to avoid filtering all nodes while taking into account the balanced deployment of nodes. (the latest version of release-v.1.17 has cancelled this algorithm, so why should the cancellation not take into account Pod's prefer and node's prefer, and fail to meet the Spec requirements of Pod)

The sampling size in the sampling schedule is briefly introduced here. The default sampling ratio formula = Max (5,50-cluster node / 125), sampling size = Max (100cluster Node * sampling ratio).

Here is an example: if the node size is 3000 nodes, then * * sampling ratio = Max (5,50-3000amp 125) = 26%, then sampling size * = Max (1000.26) = 3000. In the scheduling pipeline, as long as the Filter matches to 780 candidate nodes, you can stop the Filter process and go to the Score phase.

The Score phase is sorted according to the calculation plug-in configured by Policy, and the node with the highest score is regarded as SelectHost. Then assign the Pod to the Node, and this process is called the Reserver phase, which can be called account preemption. The process of preemption modifies the state of Pod in PodCache to Assumed (in memory state).

The scheduling process involves the life cycle of the Pod state machine. Here we briefly introduce several main states of Pod: Initial (virtual state)-> Assumed (Reserver)-> Added- > Deleted (virtual state). When the data from Informer watch to Pod has been assigned to this node, the state of Pod will be changed to Added. When the node is selected in Bind, Bind may fail, and when Bind fails, it will make a fallback, that is, return the pre-occupied account book as Assumed data to Initial, that is, erase the Assumed status and remove the Pod account book from Node.

If the Bind fails, the Pod is thrown back into the unschedulableQ queue. Under what circumstances does Pod go to backoffQ in a scheduling queue? This is a very detailed point. If the Cache changes during such a scheduling cycle, Pod will be put into the backoffQ. The waiting time in backoffQ will be shorter than in unschedulableQ. There is a downgrade strategy in backoffQ, which is an exponential downgrade of 2. Suppose the first retry is 1s, then the second is 2s, the third is 4s, the fourth is 8s, and the maximum is 10s.

Scheduling algorithm to implement Predicates (filter)

Filter can classify them into four categories according to their functional use:

Storage matching correlation

Pode and Node matching correlation

Pod and Pod matching correlation

Pod breaks up the correlation

Storage dependent

The ability to store several related filters:

Set zone\ az on the label of the pv associated with NoVolumeZoneConflict,pvc to restrict the nodes to be matched to the pv

MaxCSIVolumeCountPred is used to verify the limit of the maximum number of pv per machine specified on pvc for Provision on CSI plugin.

CheckVolumeBindingPred, which is used to verify the logic of pvc and pv during the binding process, the logic in it is more complex, mainly how to reuse pv.

NoDiskConfict,SCSI stores volume that will not be duplicated.

Pod and Node matching correlation

* * CheckNodeCondition:** verifies whether the node is ready to be scheduled. Verify that the condition type of node.condition: Ready is true, NetworkUnavailable is false, and Node.Spec.Unschedulable is false.

* * CheckNodeUnschedulable:** has a NodeUnschedulable tag on the node node. We can mark this node as unschedulable directly through kube-controller, so that the node will not be scheduled. In version 1.16, this Unschedulable has become a Taints. That is to say, you need to verify whether the Tolerates typed on Pod can tolerate this Taints.

* * PodToleratesNodeTaints:** verifies whether the Taints of Node is included by PodTolerates

* * PodFitsHostPorts:** verifies whether the Ports declared by Container on Pod is being used by the Pod that has been assigned on Node

MatchNodeSelector: verify that Pod.Spec.Affinity.NodeAffinity and Pod.Spec.NodeSelector match the Labels of Node.

Pod and Pod matching correlation

MatchinterPodAffinity: mainly the check logic of PodAffinity and PodAntiAffinity. The biggest complexity lies in the TopologyKey supported by the PodAffinityTerm description in Affinity (which can be represented in topologies such as node/zone/az), which is actually a performance killer.

Pod breaks up the correlation

EvenPodsSpread

CheckServiceAffinity

EvenPodsSpread

This is a new feature. First, let's take a look at the Spec description in EvenPodsSpread:-- describe the break-up requirements of a set of qualified Pod on a specified TopologyKey.

Let's take a look at how to describe a set of Pod, as shown in the following figure:

Spec: topologySpreadConstraints:-maxSkew: 1 whenUnsatisfiable: DoNotSchedule topologyKey: k8s.io/hostname selector: matchLabels: app: foo matchExpressions:-key: app operator: In values: ['foo',' foo2']

TopologySpreadConstraints: used to describe the topology on which Pod is to be balanced and scattered, and there is an and relationship between multiple topologySpreadConstraint; selector: a list of Pod describing a set of topologies to be satisfied topologyKey: on what topology; maxSkew: maximum allowed number of imbalances; whenUnsatisfiable: strategy when topologySpreadConstraint is not met, DoNotSchedule: acting on filter phase, ScheduleAnyway: acting on score phase. The following examples are described:

Selector selects all lable-compliant pod, which must be fragmented at the zone level, allowing a maximum number of imbalances of 1. There are three zone in the cluster. In the figure above, the value of label, the Pod of app=foo, is assigned a pod in both zone1 and zone2. The formula for calculating the unbalanced quantity is: ActualSkew = count [topo]-min (count[ topo]). First, the qualified Pod list is obtained according to selector. Secondly, it is grouped according to topologyKey to get count [topo].

As shown in the above figure:

Suppose maxSkew is 1, if the value assigned to zone1/zone2,skew is 2, which is greater than the maxSkew set earlier. This is not a match, so it can only be assigned to zone3. If assigned to zone3, min (count[ topo]) is 1, and skew [topo] is 1, then skew equals 0, so only zone2 can be assigned.

Assuming that maxSkew is 2 and assigned to Z1 (Z2), the value of skew is 2-1-0 (1-2-0), and the maximum value is 2.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.