How to analyze the principle of Kubernetes Scheduler 07/15 Update SLTechnology News&Howtos

How to analyze the principle of Kubernetes Scheduler

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article shows you how to analyze the principle of Kubernetes Scheduler, the content is concise and easy to understand, it will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

This paper is the algorithm interpretation and original understanding of Kubernetes Scheduler, focusing on the principle of pre-selection (Predicates) and optimization (Priorities) steps, and introduces the default configuration of Default Policies. Next, I will analyze the source code of Kubernetes Scheduler to find out the details of its implementation and how to develop a Policy. See my next blog post.

Introduction of Scheduler and its algorithm

Kubernetes Scheduler is a component of Kubernetes Master, which is usually deployed on the same node with API Server and Controller Manager components to form the three Musketeers of Master.

A sentence summarizes the function of Scheduler: the Pods with empty PodSpec.NodeName is selected one by one, and the most suitable Node is selected as the Destination of the Pod through two steps of pre-selection (Predicates) and optimization (Priorities).

Expand these two steps, which is the algorithm description of Scheduler:

Pre-selection: filter out the Nodes that does not meet these Policies based on the configured Predicates Policies (default is the default predicates policies collection defined in DefaultProvider), and the rest of the Nodes is used as the preferred input.

Preferred: the pre-selected Nodes is rated and ranked according to the configured Priorities Policies (default is the default priorities policies set defined in DefaultProvider). The Node with the highest score is regarded as the most suitable Node, and the Pod is Bind to this Node.

If, after a preferred ranking of Nodes, there are multiple Nodes juxtaposed with the highest score, then scheduler will randomly select one of the Node as the target Node.

Therefore, the whole schedule process, the logic of the algorithm itself is very simple, the key lies in the logic of these Policies, let's take a look at the Predicates and Priorities Policies of Kubernetes.

Predicates and Priorities PoliciesPredicates Policies

Predicates Policies is provided to Scheduler to filter out the Nodes that meets the defined conditions. Concurrent (up to 16 goroutine) starts all Predicates Policies traversal Filter for each Node to see if they all meet the configured Predicates Policies. If one Policy is not satisfied, it will be eliminated directly.

Note: the concurrent goroutine number here is All Nodes number, but no more than 16, controlled by one queue.

Kubernetes provides the following definition of Predicates Policies. You can add-- policy-config-file to the kube-scheduler startup parameters to specify the Policies collection to be used, such as:

{"kind": "Policy", "apiVersion": "v1", "predicates": [{"name": "PodFitsPorts"}, {"name": "PodFitsResources"}, {"name": "NoDiskConflict"}, {"name": "NoVolumeZoneConflict"}, {"name": "MatchNodeSelector"}, {"name": "HostName"}] "priorities": [...]}

NodeiskConflict: evaluate whether an pod can accommodate the volumes it requests and the volumes that have already been mounted. Currently supported volumes are: AWS EBS, GCE PD, ISCSI, and Ceph RBD. Check only persistent volume declarations for these supported types. Persistent volumes added directly to POD are not evaluated and are not bound by this policy.

NoVolumeZoneConflict: evaluates whether the volume requested by pod exists on the node, given the zone limit.

PodFitsResources: check whether the available resources (CPU and memory) meet the requirements of Pod. Available resources are measured by capacity minus the sum of all POD requests on the node. To learn more about the resource QoS in Kubernetes, see the QoS recommendations.

PODFITSHOSPORTS: check that any host ports required by Pod are already occupied on the node.

HostName: filters out all nodes except those specified in the NodeName field of PodSpec.

MatchNodeSelector: check that the label of the node matches the label specified in the nodeSelector field of Pod, and starting with Kubernetes v1.2, it also matches the scheduler.alpha.Kubernetes.io/affinity Pod comment, if any. For more details on these two aspects, see here.

MaxEBSVolumeCount: ensure that the number of connected ElasticBlockStore volumes does not exceed the maximum (39 by default, because Amazon recommends a maximum of 40, one of which is reserved for the root volume-see Amazon's documentation). The maximum value can be controlled by setting the KUBE_MAX_PD_VOLS environment variable.

MaxGCEPDVolumeCount: make sure that the number of GCE PersistentDisk volumes connected does not exceed the maximum (16 by default, which is the maximum allowed by GCE). The maximum value can be controlled by setting the KUBE_MAX_PD_VOLS environment variable.

CheckNodeMemoryPressure: check to see if pod can be scheduled on nodes that report memory pressure conditions. Currently, BestEffort should not be placed on memory-strapped nodes because it will be automatically expelled by kubelet.

CheckNodeDiskPressure: check to see if pod can be dispatched on nodes that report disk stress conditions. At present, under disk pressure, the pod should not be placed on the node because it will be automatically expelled by kubelet.

The following predicate policy is selected from the default provider:

NoVolumeZoneConflict

MaxEBSVolumeCount

MaxGCEPDVolumeCount

Matching finiteness

Note: Fit is determined by the affinity between pod. AffinityAnotationKey represents the key to the affinity data (json serialization) in the pod comment.

AffinityAnnotationKey string = "scheduler.alpha.kubernetes.io/affinity"

Node conflict

General prediction

Pod, in quantity

Cpu, in the kernel

Memory, in bytes

Alpha.kubernetes.io/nvidia-gpu, up to 1. 4 per node in the device as of v1. 4. Gpu

Podcast resources

Podfest Hotel

Pod

Podcast selector match

Podesnodetaints

CheckNodeMemoryPressure

Checknodedisk pressure

Priorities and policies

The Nodes obtained after the pre-selection strategy will come to the optimization step. In this process, a goroutine will be launched concurrently according to each Priorities Policy. In each goroutine, according to the corresponding policy implementation, all the pre-selected Nodes will be traversed and scored separately. Each Node and each Policy will be scored from 0 to 10, with the lowest score and the highest score. After all the goroutine corresponding to the policy are completed, according to the weight weight of each priorities policies set, the score of each policy of each node is weighted and summed as the final node score.

FinalScoreNodeA = (weight1 * priorityFunc1) + (weight2 * priorityFunc2)

Note: the concurrent goroutine number here is Priorities Policies number, there is no queue control, and the quantity is not capped. Of course, under normal circumstances, no more than a dozen or twenty Policies will be configured.

Think about it: if no Node meets the criteria after pre-selection, it is reasonable to return the FailedPredicates error directly and no longer trigger the Prioritizing phase. However, if only one Node meets the criteria after pre-selection, Prioritizing will also be triggered and follow the same process as multiple Nodes. In fact, if only one Node meets the criteria, in the optimization phase, you can directly return that Node as the final scheduled result without having to run the entire scoring process.

If, after a preferred ranking of Nodes, there are multiple Nodes juxtaposed with the highest score, then scheduler will randomly select one of the Node as the target Node.

Kubernetes provides the following definition of Priorities Policies. You can add-- policy-config-file to the kube-scheduler startup parameters to specify the Policies collection to be used, such as:

{"kind": "Policy", "apiVersion": "v1", "predicates": [...], "priorities": [{"name": "LeastRequestedPriority", "weight": 1}, {"name": "BalancedResourceAllocation", "weight": 1}, {"name": "ServiceSpreadingPriority", "weight": 1} {"name": "EqualPriority", "weight": 1}]}

LeastRequestedPriority: The node is prioritized based on the fraction of the node that would be free if the new Pod were scheduled onto the node. (In other words, (capacity-sum of requests of all Pods already on the node-request of Pod that is being scheduled) / capacity) CPU and memory are equally weighted. The node with the highest free fraction is the most preferred. Note that this priority function has the effect of spreading Pods across the nodes with respect to resource consumption.

BalancedResourceAllocation: This priority function tries to put the Pod on a node such that the CPU and Memory utilization rate is balanced after the Pod is deployed.

SelectorSpreadPriority: Spread Pods by minimizing the number of Pods belonging to the same service, replication controller, or replica set on the same node. If zone information is present on the nodes, the priority will be adjusted so that pods are spread across zones and nodes.

CalculateAntiAffinityPriority: Spread Pods by minimizing the number of Pods belonging to the same service on nodes with the same value for a particular label.

ImageLocalityPriority: Nodes are prioritized based on locality of images requested by a pod. Nodes with larger size of already-installed packages required by the pod will be preferred over nodes with no already-installed packages required by the pod or a small total size of already-installed packages required by the pod.

NodeAffinityPriority: (Kubernetes v1.2) Implements preferredDuringSchedulingIgnoredDuringExecution node affinity; see here for more details.

The following Priorities Policies is selected from the default DefaultProvider

SelectorSpreadPriority, default weight is 1

InterPodAffinityPriority, default weight is 1

Pods should be placed in the same topological domain (e.g. Same node, same rack, same zone, same power domain, etc.)

As some other pods, or, conversely, should not be placed in the same topological domain as some other pods.

AffinityAnnotationKey represents the key of affinity data (json serialized) in the Annotations of a Pod.

Scheduler.alpha.kubernetes.io/affinity= "..."

LeastRequestedPriority, default weight is 1

BalancedResourceAllocation, default weight is 1

NodePreferAvoidPodsPriority, default weight is 10000

Note: the weight setting here is large enough (10000). If the score is not 0, the final score will be very high. If the score is 0, it means that those who have to be very high are destined to be eliminated. The analysis is as follows:

If the Anotation of Node does not set key-value:

Scheduler.alpha.kubernetes.io/preferAvoidPods= "..."

Then the score of the node to the policy is 10 points, and with a weight of 10000, then the node scores at least 10W points for the policy.

If the Anotation of Node is set

Scheduler.alpha.kubernetes.io/preferAvoidPods= "..."

If the Controller corresponding to the pod is ReplicationController or ReplicaSet, then the score of the node to the policy is 0, so the score of the node to the policy is ridiculously low relative to the Node score of the Anotation. In other words, this Node will be eliminated!

NodeAffinityPriority, default weight is 1

TaintTolerationPriority, default weight is 1

# # flow Chart of scheduler algorithm

# # Summary

The task of kubernetes scheduler is to schedule pod to the most appropriate Node.

The whole scheduling process is divided into two steps: pre-selection (Predicates) and optimization (Policies).

The default scheduling policy is DefaultProvider. For more information, please see above.

You can specify a custom Json content file through the kube-scheduler startup parameter-policy-config-file, and assemble your own Predicates and Priorities policies according to the format.

The above content is how to analyze the principle of Kubernetes Scheduler. Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.