Analysis and thinking on the principle of Kubernetes Cluster Scheduler 04/16 Update SLTechnology News&Howtos

Analysis and thinking on the principle of Kubernetes Cluster Scheduler

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail how to analyze and think about the principle of Kubernetes cluster scheduler. The content of the article is of high quality, so the editor will share it with you for reference. I hope you will have a certain understanding of the relevant knowledge after reading this article.

A cluster management system in a cloud environment or at the computing warehouse level (treating the entire data center as a single computing pool) usually defines the specification of the workload and uses a scheduler to place the workload in the appropriate location of the cluster. A good scheduler can make the work of the cluster more efficient, improve resource utilization and save energy costs.

General schedulers, such as Kubernetes native scheduler Scheduler, implement the scheduling of pod to specified computing nodes (Node) according to specific scheduling algorithms and policies. But in fact, it is not an easy task to design a scheduler for large-scale shared clusters. The scheduler should not only understand the use and distribution of cluster resources, but also take into account task allocation speed and execution efficiency. The over-designed scheduler shields so many technical implementations that it is unable to complete the scheduling task as expected, or leads to abnormal conditions. The selection of inappropriate scheduler will also reduce the work efficiency, or lead to the scheduling task can not be completed.

This paper mainly introduces the scheduler of Kubernetes and the supplement and enhancement of the community from two aspects of design principle and code implementation, and makes a comparative analysis of the design and implementation of the scheduler commonly used in the industry. Through this article, readers can understand the context of the scheduler, so as to lay a foundation for selecting or even designing and implementing a scheduler suitable for the actual scene.

Note: the code in this article is based on v1.11 version of Kubernetes for analysis, if there is anything inappropriate, welcome to correct!

Basic knowledge of scheduler

1.1 definition of scheduler

General scheduling is defined as assigning a task to a specific resource to complete related work based on a certain method, in which the task can be a virtual computing element, such as thread, process, or data flow. Specific resources generally refer to processors, networks, disks, etc., and the scheduler is the concrete implementation of these scheduling behaviors. The purpose of using the scheduler is to reduce the waiting time, improve the throughput and resource utilization while sharing system resources.

The scheduler we discussed in this paper refers to the implementation of scheduling tasks in large-scale clusters, such as Mesos/Yarn (Apache), Borg/Omega (Google), Quincy (Microsoft) and so on. The cost of building a large cluster, such as the size of a data center, is very high, so it is particularly important to carefully design the scheduler.

A comparative analysis of common types of schedulers is shown in Table 1 below:

1.2 consideration criteria for schedulers

First of all, let's think about what information the scheduler uses for scheduling, and which indicators can be used to measure the quality of scheduling work.

The main job of the scheduler is to make a global optimal match between the resource requirements and the resource provider. Therefore, on the one hand, the design of the scheduler needs to understand different types of resource topologies, on the other hand, it also needs to have a full understanding of the workload.

Understanding different types of resource topologies and fully mastering the environment topology information can make scheduling work make better use of resources (for example, tasks that frequently access data can significantly reduce execution time if they are close to the data). And more complex policies can be defined based on resource topology information. However, the maintenance consumption of global resource information will limit the overall size and scheduling execution time of the cluster, which also makes it difficult for the scheduler to expand, thus limiting the size of the cluster.

On the other hand, because different types of workloads have different or even opposite characteristics, the scheduler also needs to have a full understanding of the workload, such as service tasks, less resource requirements, long running time, and is not sensitive to scheduling time; while batch tasks require large resource requirements, short running time, tasks may be relevant, and require higher scheduling time. At the same time, the scheduler should also meet the special requirements of the user. Such as tasks as centralized or decentralized as possible, to ensure that multiple tasks are carried out at the same time.

In general, a good scheduler needs to balance a single scheduling (scheduling time, quality), take into account the impact of environmental changes on the scheduling results, maintain the optimal results (rescheduling if necessary), and ensure the size of the cluster. at the same time, it should also be able to support users to upgrade and expand without awareness. The results of scheduling need to meet, but are not limited to, the following conditions, and are likely to meet the highest possible priority:

Resource utilization maximization

Meet the scheduling requirements specified by the user

Meet custom priority requirements

High scheduling efficiency and the ability to make decisions quickly according to resource conditions

Be able to adjust the scheduling strategy according to changes in load

Take full account of fairness at all levels

1.3 influence of Lock on Scheduler Design

For resource scheduling, the application of locks must be involved, and the choice of different types of locks will directly determine the use scenario of the scheduler. Two-tier schedulers such as Mesos generally adopt the design and implementation of pessimistic lock. When all the resources meet the needs of the task, the task is started, otherwise more resources will be applied incrementally until the scheduling conditions are met. The scheduler with shared state will consider the implementation of optimistic lock, and the Kubernetes default scheduler is designed based on optimistic lock.

Let's first compare the difference between pessimistic lock and optimistic lock processing logic through a simple example, assuming the following scenario:

Job A read object O

Job B read object O

Job A updates object O in memory

Job B updates object O in memory

Job A writes object O to achieve persistence

Job B writes object O to achieve persistence

The pessimistic lock is designed to implement an exclusive lock on object O until job A finishes updating object O and writes persistent data to block other read requests. The optimistic lock is designed to implement a shared lock on object O, assuming that all work can be done normally until there is a conflict, recording the conflict and rejecting the conflicting request.

Optimistic locks are generally implemented in combination with the resource version, which is also the example above. The current object O version is v1. Job A first completes the write persistence operation to object O and marks the object O version as v2. Job B will cancel the change when it finds that the object version has changed when it is updated.

Analysis of Kubernetes Scheduler

Most of the computing tasks in Kubernetes run through pod. A pod is a user-defined combination of one or more containers that share storage, network, and namespace resources, and is the smallest unit that a scheduler can schedule. The scheduler of Kubernetes is part of the control plane, which mainly listens to the pod task list provided by APIServer, obtains the pod to be scheduled, and allocates the running nodes to these pod according to the pre-selection and preferred strategy. Generally speaking, the scheduler mainly gets a scheduling result based on the description of resource consumption.

2.1Design of Kubernetes Scheduler

The scheduling design of Kubernetes refers to the implementation of Omega. It mainly adopts two-tier scheduling architecture, scheduling based on global state, controlling resource ownership through optimistic locking, and supporting the design of multiple schedulers.

The two-tier architecture helps the scheduler shield many underlying implementation details, implement policies and restrictions separately, and filter available resources, so that the scheduler can adapt to resource changes more flexibly and meet the personalized scheduling needs of users. Compared with the single architecture, it is not only easier to add custom rules and support cluster dynamic scaling, but also has better support for large-scale clusters (support for multiple schedulers).

Compared to architectures that use pessimistic locks and partial environment views (such as Mesos), the advantage of global state-based and optimistic locking implementation is that the scheduler can see all the resources available to the cluster and then preempt the resources of low-priority tasks to achieve the state required by the policy. Its resource allocation is more in line with the policy requirements and avoids the problem of cluster deadlock caused by hoarding resources. Of course, there will be preemptive task overhead and conflict-induced retries, but overall resource utilization is higher.

There is only one scheduler by default in Kubernetes, while the design of Omega itself enables resource allocation managers to share resource environment information with multiple schedulers. So by design, Kubernetes can support multiple schedulers.

2.2 implementation of Kubernetes Scheduler

The workflow of the Kubernetes scheduler is shown in the following figure. The essence of the work of the scheduler is to cycle through the scheduling process of each pod by listening for the creation, update, deletion and other events of pod. If the scheduling process is smooth, then based on pre-selection and optimization strategy, complete the binding of pod and host nodes, and finally inform kubelet to complete the process of pod startup. If you encounter the wrong scheduling process, through the way of priority preemption, obtain the ability of priority scheduling, and then re-enter the process of scheduling cycle, waiting for successful scheduling.

2.2.1 complete logic of the scheduling loop

The overall process for the Kubernetes scheduler to complete the scheduling is shown in figure 1. The implementation logic of each step is described below.

(1) start the cycle process based on event-driven

The Kubernetes scheduler maintains sharedIndexInformer to complete the initialization of the informer object. That is, the scheduler listens for the operation events of pod creation, update and deletion, actively updates the event cache, persists it to the memory queue, and initiates a scheduling cycle.

The function entry for this process is in

Https://github.com/kubernetes/kubernetes/blob/9cbccd38598e5e2750d39e183aef21a749275087/pkg/scheduler/factory/factory.go#L631

(2) add the unscheduled pod to the scheduler cache and update the scheduler queue

The Informer object is responsible for listening for pod events. The main event types are addPodToCache, updatePodInCache, deletePodFromCache for scheduled pod and addPodToSchedulingQueue, updatePodInSchedulingQueue, deletePodSchedulingQueue for unscheduled pod. The function entry for this process is:

Https://github.com/kubernetes/kubernetes/blob/9cbccd38598e5e2750d39e183aef21a749275087/pkg/scheduler/eventhandlers.go

The meaning of the various events is shown in Table 2 below:

(3) schedule each pod in the scheduler queue

It should be pointed out here that in the process of scheduling a single pod, the scheduling algorithm for host nodes is executed sequentially. In other words, during the scheduling process, pod will strictly enforce the policies and priorities built into Kubernetes, and then select the most appropriate node.

The scheduling process of a single pod can be divided into two stages: pre-selection and optimization. In the pre-selection phase, the scheduler filters out the hosts that do not meet the requirements according to a set of rules, and selects the appropriate node; in the optimization phase, the node with the highest score is selected for scheduling by scoring the node priority (according to the overall optimization strategy, etc.).

The scheduling process of a single pod is entered by the following function: https://github.com/kubernetes/kubernetes/blob/9cbccd38598e5e2750d39e183aef21a749275087/pkg/scheduler/scheduler.go#L457

Of course, the scheduling process may fail because there are no nodes that meet the pod operating conditions, and when the pod has a priority, the contention mechanism will be triggered. Pod with high priority will attempt to preempt lower priority pod resources. The relevant parts of the code are implemented as follows:

If the resource preemption succeeds, the schedulable process is marked on the next scheduling cycle. If preemption fails, the scheduler exits. The fact that the scheduling results are not saved means that the pod will still appear in the unassigned list.

(4) next, check whether the conditions of the plug-in provided by the user are met.

Reserve plug-in is the interface that Kubernets leaves for users to extend. Based on reserver plug-in, users can set custom conditions at this stage to meet the desired scheduling process. The entry function of the plug-in is: https://github.com/kubernetes/kubernetes/blob/9cbccd38598e5e2750d39e183aef21a749275087/pkg/scheduler/plugins/registrar.go

You can view an example of custom scheduling by extending the reserver interface of the plug-in in https://github.com/kubernetes/kubernetes/tree/9cbccd38598e5e2750d39e183aef21a749275087/pkg/scheduler/plugins/examples.

(5) after finding the satisfied node, update the label of the Pod object to save the results of the scheduled node.

The function entry for this process is in https://github.com/kubernetes/kubernetes/blob/9cbccd38598e5e2750d39e183aef21a749275087/pkg/scheduler/scheduler.go#L517.

(6) complete the binding of pod to node

The binding of pod to node requires first to mount the storage volume, and finally to complete the final binding through the update of the pod object. The logic of the specific code can be referred to: https://github.com/kubernetes/kubernetes/blob/9cbccd38598e5e2750d39e183aef21a749275087/pkg/scheduler/scheduler.go#L524.

(7) after the scheduling is completed, the main cooperative program returns to execute the next scheduling.

At this point, the complete process of scheduling is completed. The following focuses on how Kubernetes selects nodes in a single pod scheduling process, including pre-selection and optimization strategies.

2.2.2 scheduling process for a single pod

The scheduling process for a single pod is shown in figure 2 below. It mainly includes the pre-selection process of pre-filter, filter, post-filter and the optimization process of scoring.

Figure 2: scheduling process for a single Pod

(1) pod enters the scheduling phase, and first enters the pre-selection stage.

The nodes that meet the pod scheduling conditions are found by rule filtering.

K8s has a number of built-in filtering rules, and the scheduler filters in a predefined order. The built-in filtering rules mainly include checking whether the node has sufficient resources (such as CPU, memory, GPU, etc.) to meet the running requirements of pod, checking whether the HostPort required by the pod container has been occupied by other containers or services on the node, checking whether the node tag (label) matches the nodeSelector attribute requirements of pod, and judging whether pod can be dispatched to the node according to the relationship between taints and toleration. Whether pod meets some conditions tolerated by the node Also check to see if the csi maximum mountable volume limit is met.

(2) after filtering the nodes by the pre-selection strategy, enter the optimization stage.

The scheduler scores according to the preset default rules (the sum of the priority function scores * weights), and then selects the node with the highest score to bind pod to the node.

The built-in priority function of Kubernetes is as follows, which mainly includes average distribution priority (SelectorSpreadPriority), minimum access priority (LeastRequestedPriority), balanced resource distribution priority (BalancedResourceAllocation) and so on.

SelectorSpreadPriority: for better high availability, schedule multiple Pod replicas belonging to the same service, replication controller or replica to multiple different nodes as far as possible.

InterPodAffinityPriority: by iterating over the elements of weightedPodAffinityTerm to calculate the sum, if the corresponding PodAffinityTerm is satisfied for that node, then "weight" is added to the sum, and the node with the highest sum is the most preferred.

LeastRequestedPriority: the priority of a node is determined by the ratio of the node's idle resources to the node's total capacity, that is, (total capacity-sum of Pod capacity on the node-capacity of the new Pod) / total capacity. CPU and memory have the same weight, and the higher the ratio, the higher the score.

The closer the BalancedResourceAllocation:CPU and memory utilization is, the higher the priority is. This strategy cannot be used alone, but must be used at the same time with LeastRequestedPriority, that is, try to choose machines with more balanced resources after the deployment of Pod.

NodePreferAvoidPodsPriority (weight 1w): if the node's Anotation does not set key-value:scheduler. Alpha.kubernetes.io/ preferAvoidPods = "…" The node's score for the policy is 10 points, plus a weight of 10000, then the node's score for the policy is at least 10W. If the node's Anotation is set to scheduler.alpha.kubernetes.io/preferAvoidPods = "…" If the Controller corresponding to the pod is ReplicationController or ReplicaSet, the node's score for the policy is 0.

NodeAffinityPriority: implement the affinity mechanism in Kubernetes scheduling.

TaintTolerationPriority: use the tolerationList in Pod to match the node Taint. The more items that are successfully matched, the lower the score.

Deficiency of Kubernetes Scheduler and its solution

3.1 several typical problems and their solutions

(1) the scheduler only dispatches once according to the current resource and environment conditions. Once the scheduling is completed, there is no mechanism to realize the adjustment.

Although pod will change only if it exits itself, users delete, and cluster resources are insufficient. However, changes in the resource topology can occur at any time, such as batch tasks will end, nodes will be added or crashed. As a result of these situations, the scheduling result may be optimal at the time of scheduling, but the scheduling quality decreases due to the occurrence of the above situation after the topology change.

After community discussion, it is considered that it is necessary to re-find the pod that does not meet the scheduling policy, delete and create a replacement to reschedule, based on which the project descheduler is designed and launched.

(2) scheduling is performed by a single pod, so it is difficult to schedule interrelated workloads.

Such as big data analysis, machine learning and other computing mostly rely on batch tasks, this kind of workload is highly related and dependent on each other. In order to solve this problem, the community discussed and proposed coscheduling to schedule a group of pod projects at a time, in order to optimize the execution of this kind of scheduling tasks.

(3) at present, the implementation of the scheduler is only concerned with whether the pod can be bound to the node, and the data of resource usage is not fully utilized.

At present, the usage of cluster can only be deduced indirectly from monitoring data. If the remaining resources of K8s cluster are insufficient, there is no intuitive data that can be used to trigger capacity expansion or alarm.

According to the above situation, the community launched the cluster-capacity framework project to provide cluster capacity data, making it convenient for cluster maintenance programs or administrators to expand cluster capacity based on these data. There are also projects that capture monitoring data to calculate the overall load of the cluster for scheduling algorithm reference, such as poseidon.

Customized extensions to Kubernetes Scheduler

As mentioned in the previous section, the general scheduler can not meet the personalized needs of users in some scenarios, and the cluster scheduler running in the actual environment often needs to be customized and secondary development according to the actual needs.

The scheduler of kubernetes is implemented in the form of plug-in, which is convenient for users to customize and redevelop the scheduling. There are several ways to choose a custom scheduler:

Change the Kubernetes built-in policy by changing the default policy file or by recompiling the scheduler.

Expand the interface of the scheduler in each stage of pre-filter, filter, post-filter, reserve, prebind, bind and post-bind, and change the specific implementation logic of scheduler filtering, scoring, preemption and reservation.

Change the scheduler scheduling algorithm to implement the scheduler logic from scratch.

A case study of Enterprise scenario Application

4.1 General Computing scenario

Kubernetes default-scheduler meets the needs of general computing and mainly serves several scenarios: continuous integration and continuous deployment platform (DevOps platform), container application operation and maintenance platform (container platform) characterized by standard three-tier architecture applications, PaaS platform and core infrastructure platform (aPaaS platform) for cloud native applications.

In general, the standard Kubernetes scheduler can meet the requirements of most computing scenarios, mainly solving the scheduling problem between different heterogeneous cloud resources in the process of cloud application, and the dynamic scheduling response of elastic scaling, fault self-healing and so on. Standard middleware service and database service are based on the scheduling problem of daily operation and maintenance specification and the comprehensive scheduling process of cloud native application in service governance, configuration management, state feedback and event link tracking.

4.2 batch scenario

Big data analysis and machine learning tasks require a lot of resources for execution. When multiple tasks are carried out at the same time, resources will be quickly exhausted, and some tasks will need to wait for resources to be released. The steps of this type of task are often related to each other, and running the steps separately may affect the final result. When using the default scheduler, when the cluster resources are tight, even the resource-consuming pod is waiting for the dependent pod to finish running, while the cluster has no free resources to run dependent tasks, resulting in deadlock. Therefore, when scheduling such tasks, group scheduling is supported (scheduling only after the resources required for scheduling jobs have been collected), which reduces the number of pod, thus reducing the load of the scheduler and avoiding many problems caused by resource constraints.

Unlike the default scheduler, which dispatches one pod at a time, kube-batch defines a PodGroup that defines a set of related pod resources and implements a completely new scheduler. The process of the scheduler is basically the same as the default scheduler. Podgroup ensures that a set of pod can be scheduled simultaneously. It is an implementation of the Kubernetes community in big data's analysis scenario.

4.3 Domain-specific business scenarios

Specific business scenarios require the scheduler to generate scheduling policies quickly and avoid scheduling timeouts as much as possible. Poseidon is a kind of scheduler based on graph application data locality to reduce task execution time and mix a variety of scheduling algorithms to improve scheduling speed in large-scale clusters.

Poseidon is a scheduler based on Firmament algorithm, which constructs resource usage information by receiving heapster data. Call the Firmament implementation for scheduling. Inspired by Quincy [11], the Firmament algorithm constructs a graph from task to node, but in order to reduce the scheduling time, the author merges two algorithms for calculating the shortest path and changes all environmental information synchronization into incremental synchronization. Let Firmament handle short-time batch tasks faster than Quincy, and there is no Kubernetes default scheduler timeout when resources are scarce.

This paper mainly introduces the Kubernetes scheduler and the supplement and enhancement made by the community from the aspects of design principle and code implementation, summarizes the design principle of Kubernetes scheduler and how to enhance Kubernetes to meet business requirements in which scenarios, and provides the basis and evaluation criteria for technology selection.

On how to carry out Kubernetes cluster scheduler principle analysis and thinking is shared here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.