What is the working mechanism of kubernetes to improve Scheduler throughput 07/12 Update SLTechnology News&Howtos

What is the working mechanism of kubernetes to improve Scheduler throughput

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

The main content of this article is to explain "what is the working mechanism of kubernetes to improve Scheduler throughput". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Now let the editor take you to learn "what is the working mechanism of kubernetes to improve the throughput of Scheduler"?

The concept and significance of Equivalence Class

In 2015, google published a paper on Borg, "Large-scale cluster management at Google with Borg", which described Equivalence Class as follows:

Equivalence classes: Tasks in a Borg job usually have identical requirements and constraints, so rather than determining feasibility for every pending task on every machine, and scoring all the feasible machines, Borg only does feasibility and scoring for one task per equivalence class-a group of tasks with identical requirements.

Equivalence Class is currently used to accelerate Predicate in Kubernetes Scheduler and improve the throughput performance of Scheduler. Kubernetes scheduler maintains Equivalence Cache data in a timely manner, and when something happens (such as delete node, bind pod, etc.), you need to immediately invalid the cached data in the relevant Equivalence Cache.

An Equivalence Class is used to define a set of related information about a set of Pods with the same Requirements and Constraints. During the Predicate phase of the Scheduler, it is only necessary to Predicate one Pod in the Equivalence Class and put the result of the Predicate into the Equivalence Cache for other Pods (called Equivalent Pods) in the Equivalence Class to reuse the result. A normal Predicate process occurs only if there is no Predicate Result in Equivalence Cache that can be reused.

What kind of Pods is classified as the same Equivalence Class? According to its definition, as long as Pods has some of the same field, such as resources requirement, label, affinity, etc., they are considered to be Equivalent Pods and belong to the same Equivalence Class. However, considering that users may modify the fields of Pods at any time, the Scheduler needs to update the Equivalence Class changes to which the Pod belongs, and the possible ongoing Predicate needs to be aware of this change and make changes, which makes the problem extremely complicated. Therefore, at present, Scheduler only classifies those Pods that belong to the same OwnerReference (including RC,RS,Job, StatefulSet) into the same Equivalence Class, for example, if a RS defines N copies, then these N copies Pods corresponds to one Equivalence Class. Scheduler calculates a uint64 EquivalenceHash value for each Equivalent Pods in Equivalence Class.

Note that as of Kubernetes 1.10, even if there are two RS with the same Pod Template, there will be two Equivalence Class.

How Equivalence Class works

To use EquivalenceClass, you need to enable EnableEquivalenceClassCache Feature Gate, which is still the Alpha phase as of Kubernetes 1.10.

In my analysis of Predicate in several previous blogs about scheduler, I mentioned that all successful Predicate Policy registrations will be on scheduler.findNodesThatFit (pod, nodes, predicateFuncs.) Call scheduler.podFitsOnNode (pod, node, predicateFuncs...) to each node according to a certain number of parallelism in the process. Perform a registered Predicate Policys check.

The input to podFitsOnNode is a pod, a node, and a series of successfully registered predicateFuncs to check whether the node meets the pre-selection criteria for the pod. After joining Equivalence Class, the pre-selection phase changes as follows:

Before pre-selection, check whether the pod has a corresponding Equivalence Class.

If there is a corresponding Equivalence Class, then check to see if there is an available Predicate Result in the Equivalence Cache, otherwise trigger a complete normal preselection.

If a Predicate Result is available, use the Cached Predicate Result directly to complete the preselection, otherwise the full normal preselection will be triggered.

Equivalence Cache stores the Predicates Results for each node, which is a 3-tier Map object:

The first layer key is node name, which represents the name of the node

The second layer key is predicateKey, which represents the preselected policy, so the number of algorithmCache Entries corresponding to this node does not exceed the number of Predicate Policies registered by Scheduler, which is used to ensure the Cache size and prevent poor performance when looking up Equivalence Cache.

The third layer of key is Equivalence Hash, which has been mentioned earlier.

For example, algorithmCache [$nodeName] .accounatesCache.Get ($predicateKey) [$equivalenceHash] indicates whether the Pods corresponding to $equivalenceHash preselects $predicateKey on the $nodeName node successfully.

As of Kubernetes 1.10, the following is the supporting list (20):

MatchInterPodAffinity

CheckVolumeBinding

CheckNodeCondition

GeneralPredicates

HostName

PodFitsHostPorts

MatchNodeSelector

PodFitsResources

NoDiskConflict

PodToleratesNodeTaints

CheckNodeUnschedulable

PodToleratesNodeNoExecuteTaints

CheckNodeLabelPresence

CheckServiceAffinity

MaxEBSVolumeCount

MaxGCEPDVolumeCount

MaxAzureDiskVolumeCount

NoVolumeZoneConflict

CheckNodeMemoryPressure

CheckNodeDiskPressure

Note that even if the Pod finds the corresponding Equivalence Class,Equivalence Cache, there may be no available Predicate Result, or the corresponding Predicate Result has expired. The normal Predicate will be triggered and the Result will be written to the Equivalence Cache.

How to maintain and update Equivalence Cache? If the Equivalence Cache corresponding to the whole node is updated frequently, it goes against the original intention of Equivalence Cache design and can not improve the efficiency of Predicate.

As mentioned earlier, the three-tier Map structure design of Equivalence Cache, the second layer Key is predicateKey, so Scheduler can only invalid a single Predicate Result, rather than blindly invalid the algorithmCache of the entire node.

Scheduler will Watch the relevant API Objects Add/Update/Delete Event and correspond to the Equivalence Cache data of the relevant policy invalid. For more information, please see the source code analysis section below.

Equivalence Class Source Code Analysis of Equivalence Cache data structure

The Equivalence Cache structure is defined as follows:

/ / EquivalenceCache holds:// 1. A map of AlgorithmCache with node name as key// 2. Function to get equivalence podtype EquivalenceCache struct {sync.RWMutex getEquivalencePod algorithm.GetEquivalencePodFunc algorithmCache map [string] AlgorithmCache} / / The AlgorithmCache stores PredicateMap with predicate name as keytype AlgorithmCache struct {/ / Only consider predicates for now predicatesCache * lru.Cache}

The real cached data of Equivalence Cache is stored through algorithmCache Map, and its key is nodeName.

The Predicate Result Cache on each node is stored through AlgorithmCache.predicateCache, and predicateCache is the LRU (Least Recently Used, least recently used algorithm) Cache, which can only store a certain number of Entries,Kubernetes with a specified maximum value of 100 (there are 20 Predicate Funcs implemented by default in Kubernetes 1.10).

LRU Cache is a Cache replacement algorithm, meaning "least recently used". When the Cache is full (there are no free cache blocks), it replaces the data that meets the "least recently used" from the Cache and ensures that the first data in the Cache is recently accessed. According to the "locality principle", such data is more likely to be accessed by the next program to improve performance.

PredicateCache is also KMurv storage, key is predicateKey,value and PredicateMap.

The key of predicateMap is uint64 and the Equivalence Hash,value is HostPredicate.

HostPredicate is used to represent the matching result of Pod using Predicate Policy and a certain node. The structure is as follows:

/ / HostPredicate is the cached predicate resulttype HostPredicate struct {Fit bool FailReasons [] algorithm.PredicateFailureReason}

The core operation of Equivalence Cache

InvalidateCachedPredicateItem: used to delete Predicate Result cache data from Equivalence Cache for all EquivalenceHash (corresponding Equivalent Pods) of a predicate policy on a node.

Func (ec * EquivalenceCache) InvalidateCachedPredicateItem (nodeName string, predicateKeys sets.String) {. If algorithmCache, exist: = ec.algorithmCache [nodeName]; exist {for predicateKey: = range predicateKeys {algorithmCache.predicatesCache.Remove (predicateKey)}}.}

InvalidateCachedPredicateItemOfAllNodes: used to delete the Predicate Result cache data of all EquivalenceHash (corresponding to Equivalent Pods) corresponding to the specified predicate policy collection on all node.

Func (ec * EquivalenceCache) InvalidateCachedPredicateItemOfAllNodes (predicateKeys sets.String) {. / / algorithmCache uses nodeName as key, so we just iterate it and invalid given predicates for _, algorithmCache: = range ec.algorithmCache {for predicateKey: = range predicateKeys {/ / just use keys is enough algorithmCache.predicatesCache.Remove (predicateKey)}}.}

PredicateWithECache: check whether the Predicate Result cache data in Equivalence Cache has available data, and if it hits the cache, it will be returned directly according to the Predicate Result in the cache as the pre-selected result of the Predicate policy on the pod on the node. If it fails, false and the reason for the failure are returned.

/ PredicateWithECache returns:// 1. If fit// 2. Reasons if not fit// 3. If this cache is invalid// based on cached predicate resultsfunc (ec * EquivalenceCache) PredicateWithECache (podName, nodeName, predicateKey string, equivalenceHash uint64, needLock bool,) (bool, [] algorithm.PredicateFailureReason, bool) {. If algorithmCache, exist: = ec.algorithmCache [nodeName]; exist {if cachePredicate, exist: = algorithmCache.predicatesCache.Get (predicateKey); exist {predicateMap: = cachePredicate. (PredicateMap) / / TODO (resouer) Is it possible a race that cache failed to update immediately? If hostPredicate, ok: = predicateMap [equivalenceHash] Ok {if hostPredicate.Fit {return true, [] algorithm.PredicateFailureReason {}, false} return false, hostPredicate.FailReasons False} / / is invalid return false, [] algorithm.PredicateFailureReason {}, true}} return false, [] algorithm.PredicateFailureReason {}, true}

UpdateCachedPredicateItem: when PredicateWithECache fails to hit using Predicate Result Cache data, scheduler will call the corresponding Predicate Funcs to trigger the real pre-selection logic. When it is finished, the newly selected result will be updated to the Equivalence Cache cache through UpdateCachedPredicateItem. The initialization of each node's predicateCache is also done here.

/ / UpdateCachedPredicateItem updates pod predicate for equivalence classfunc (ec * EquivalenceCache) UpdateCachedPredicateItem (podName, nodeName, predicateKey string, fit bool, reasons [] algorithm.PredicateFailureReason, equivalenceHash uint64, needLock bool,) {. If _, exist: = ec.algorithmCache [nodeName];! exist {ec.algorithmCache [nodeName] = newAlgorithmCache ()} predicateItem: = HostPredicate {Fit: fit, FailReasons: reasons,} / / if cached predicate map already exists, just update the predicate by key if v, ok: = ec.encrypmCache.Cache.Get (predicateKey) Ok {predicateMap: = v. (PredicateMap) / / maps in golang are references, no need to add them back predicateMap [equivalenceHash] = predicateItem} else {ec.pragmCache.accounatesCache.add (predicateKey, PredicateMap {equivalenceHash: predicateItem })}} initialization of Equivalence Cache

When Kubernetes registers predicates, priorities, and scheduler extenders, it also initializes Equivalence Cache and passes it into scheduler config.

/ / Creates a scheduler from a set of registered fit predicate keys and priority keys.func (c * configFactory) CreateFromKeys (predicateKeys, priorityKeys sets.String, extenders [] algorithm.SchedulerExtender) (* scheduler.Config, error) {. / / Init equivalence class cache if c.enableEquivalenceClassCache & & getEquivalencePodFuncFactory! = nil {pluginArgs, err: = c.getPluginArgs () if err! = nil {return nil Err} c.equivalencePodCache = core.NewEquivalenceCache (getEquivalencePodFuncFactory (* pluginArgs),) glog.Info ("Created equivalence class cache")}...} / / NewEquivalenceCache creates an EquivalenceCache object.func NewEquivalenceCache (getEquivalencePodFunc algorithm.GetEquivalencePodFunc) * EquivalenceCache {return & EquivalenceCache {getEquivalencePod: getEquivalencePodFunc AlgorithmCache: make (map [string] AlgorithmCache),}}

NewEquivalenceCache is responsible for initializing EquivalenceCache, so where does getEquivalencePod complete its registration? Complete registration of GetEquivalencePodFunc during defualt algorithm provider initialization (only defualt provider? Can't you pass configfile?) Note that only PVCInfo is passed in factory.PluginFactoryArgs.

GetEquivalencePodFunc is a function that gets an EquivalencePod from a pod.

Pkg/scheduler/algorithmprovider/defaults/defaults.go:38func init () {... / / Use equivalence class to speed up heavy predicates phase. Factory.RegisterGetEquivalencePodFunction (func (args factory.PluginFactoryArgs) algorithm.GetEquivalencePodFunc {return predicates.NewEquivalencePodGenerator (args.PVCInfo)},).}

Why only pass in PVCInfo? Or why do you need PVCInfo? To answer this question, let's first look at the definitions of EquivalencePod and getEquivalencePod.

/ / EquivalencePod is a group of pod attributes which can be reused as equivalence to schedule other pods.type EquivalencePod struct {ControllerRef metav1.OwnerReference PVCSet sets.String}

EquivalencePod defines which Pods with the same attributes belongs to Equivalent Pods,Equivalence Hash, which is calculated based on the two attributes specified in the EquivalencePod of Pod, which are:

ControllerRef: corresponding to the meta.OwnerReference of Pod and the Controller Object to which Pod belongs, which can be one of the RS,RC,Job,StatefulSet types.

PVCSet: is all the PVCs IDs collections referenced by Pod.

Therefore, only two Pod that belong to the same Controller and refer to the same PVCs object are considered EquivalentPod and correspond to the same Equivalence Hash.

GetEquivalencePod acquires the EquivalencePod object to which it belongs based on the OwnerReference and PVC information in Pod Object.

Func (e * EquivalencePodGenerator) getEquivalencePod (pod * v1.Pod) interface {} {for _, ref: = range pod.OwnerReferences {if ref.Controller! = nil & * ref.Controller {pvcSet, err: = e.getPVCSet (pod) if err = = nil {/ A pod can only belongs to one controller, so let's return. Return & EquivalencePod {ControllerRef: ref, PVCSet: pvcSet,}} return nil}} return nil} when to generate Equivalence Hash corresponding to Pod

The pre-selected entry is findNodesThatFit, that is, the EquivalenceHash of getEquivalenceClassInfo computing Pod is called in findNodesThatFit, and then the hash value is passed into podFitsOnNode for subsequent EquivalenceClass functions.

Func findNodesThatFit (pod * v1.Pod, nodeNameToInfo map [string] * schedulercache.NodeInfo, nodes [] * v1.Node, predicateFuncs map [string] algorithm.FitPredicate, extenders [] algorithm.SchedulerExtender, metadataProducer algorithm.PredicateMetadataProducer, ecache * EquivalenceCache, schedulingQueue SchedulingQueue, alwaysCheckAllPredicates bool,) ([] * v1.Node, FailedPredicateMap, error) {. Var equivCacheInfo * equivalenceClassInfo if ecache! = nil {/ / getEquivalenceClassInfo will return immediately if no equivalence pod found equivCacheInfo = ecache.getEquivalenceClassInfo (pod)} checkNode: = func (I int) {nodeName: = Nodes [I] .N ame fits, failedPredicates Err: = podFitsOnNode (pod, meta, nodeNameToInfo [nodeName], predicateFuncs, ecache, schedulingQueue AlwaysCheckAllPredicates, equivCacheInfo,).}.}

The principle that getEquivalenceClassInfo calculates the EquivalenceHash of pod is as follows:

/ getEquivalenceClassInfo returns the equivalence class of given pod.func (ec * EquivalenceCache) getEquivalenceClassInfo (pod * v1.Pod) * equivalenceClassInfo {equivalencePod: = ec.getEquivalencePod (pod) if equivalencePod! = nil {hash: = fnv.New32a () hashutil.DeepHashObject (hash, equivalencePod) return & equivalenceClassInfo {hash: uint64 (hash.Sum32 ()) }} return nil}

It can be seen that EquivalenceHash uses FNV algorithm to hash getEquivalencePod.

When will the Predicate Result of Equivalent Pod be added to PredicateCache

Let's first take a look at the relevant implementation of podFitsOnNode:

Func podFitsOnNode (pod * v1.Pod, meta algorithm.PredicateMetadata, info * schedulercache.NodeInfo, predicateFuncs map [string] algorithm.FitPredicate, ecache * EquivalenceCache, queue SchedulingQueue, alwaysCheckAllPredicates bool, equivCacheInfo * equivalenceClassInfo,) (bool, [] algorithm.PredicateFailureReason, error) {. If predicate, exist: = predicateFuncs [predicateKey]; exist {/ / Use an in-line function to guarantee invocation of ecache.Unlock () / / when the in-line function returns. Func () {var invalid bool if eCacheAvailable {/ / Lock ecache here to avoid a race condition against cache invalidation invoked / / in event handlers. This race has existed despite locks in equivClassCacheimplementation. Ecache.Lock () defer ecache.Unlock () / / PredicateWithECache will return its cached predicate results. Fit, reasons, invalid = ecache.PredicateWithECache (pod.GetName (), info.Node (). GetName (), predicateKey, equivCacheInfo.hash False)} if! eCacheAvailable | | invalid {/ / we need to execute predicate functions since equivalence cache does not work fit, reasons, err = predicate (pod, metaToUse) NodeInfoToUse) if err! = nil {return} if eCacheAvailable { / / Store data to update equivClassCacheafter this loop. If res, exists: = predicateResults [predicateKey]; exists {res.Fit = res.Fit & & fit res.FailReasons = append (res.FailReasons, reasons...) PredicateResults [predicateKey] = res} else {predicateResults [predicateKey] = HostPredicate {Fit: fit FailReasons: reasons}} result: = predicateResults [predicateKey] ecache.UpdateCachedPredicateItem ( Pod.GetName () Info.Node () .GetName (), predicateKey, result.Fit, result.FailReasons, equivCacheInfo.hash False)} ().}

When podFitsOnNode, it will first check whether the cache is hit in the Equivalence Cache through PredicateWithECache:

If hit data is available, the corresponding Predicate Policy is processed.

If the data is not hit, the call to predicate is triggered, and then the result of predicate is added / updated to the cache through UpdateCachedPredicateItem.

Maintain Equivalence Cache

Let's go back to Scheduler Config Factory and take a look at the operation of Equivalence Cache in the EventHandler registered with podInformer, nodeInformer, serviceInformer, pvcInformer, and so on in Scheduler.

Assume Pod

When the scheduling of pod is completed, Pod Assume will be performed before Bind Node, and there will be operations on Equivalence Cache in the process of Assume.

/ / assume signals to the cache that a pod is already in the cache, so that binding can be asynchronous.// assume modifies `assumed`.func (sched * Scheduler) assume (assumed * v1.Pod, host string) error {. / / Optimistically assume that the binding will succeed, so we need to invalidate affected / / predicates in equivalence cache. / / If the binding fails, these invalidated item will not break anything. If sched.config.Ecache! = nil {sched.config.Ecache.InvalidateCachedPredicateItemForPodAdd (assumed, host)} return nil}

Call InvalidateCachedPredicateItemForPodAdd to operate on Equivalence Cache when Assume Pod.

Func (ec * EquivalenceCache) InvalidateCachedPredicateItemForPodAdd (pod * v1.Pod, nodeName string) {/ / GeneralPredicates: will always be affected by adding a new pod invalidPredicates: = sets.NewString ("GeneralPredicates") / / MaxPDVolumeCountPredicate: we check the volumes of pod to make decision. For _, vol: = range pod.Spec.Volumes {if vol.PersistentVolumeClaim! = nil {invalidPredicates.Insert ("MaxEBSVolumeCount", "MaxGCEPDVolumeCount") "MaxAzureDiskVolumeCount")} else {if vol.AWSElasticBlockStore! = nil {invalidPredicates.Insert ("MaxEBSVolumeCount")} if vol.GCEPersistentDisk! = nil {invalidPredicates.Insert ("MaxGCEPDVolumeCount") } if vol.AzureDisk! = nil {invalidPredicates.Insert ("MaxAzureDiskVolumeCount")} ec.InvalidateCachedPredicateItem (nodeName InvalidPredicates)}

As you can see in InvalidateCachedPredicateItemForPodAdd, Assume Pod deletes the predicateCache corresponding to the following predicateKey on that node:

GeneralPredicates

If PVCs is referenced in the pod, the PredicateCaches of "MaxEBSVolumeCount", "MaxGCEPDVolumeCount" and "MaxAzureDiskVolumeCount" will be deleted.

If AWSElasticBlockStore is used in pod volume, MaxEBSVolumeCount PredicateCache is deleted

If GCEPersistentDisk is used in pod volume, MaxGCEPDVolumeCount PredicateCache is deleted

If AzureDisk is used in pod volume, MaxAzureDiskVolumeCount PredicateCache is deleted

Update Pod in Scheduled Pod Cache

When scheduler performs NewConfigFactory, register Update assignedNonTerminatedPod Event Handler as updatePodInCache.

Func (c * configFactory) updatePodInCache (oldObj, newObj interface {}) {... C.invalidateCachedPredicatesOnUpdatePod (newPod, oldPod) c.podQueue.AssignedPodUpdated (newPod)} func (c * configFactory) invalidateCachedPredicatesOnUpdatePod (newPod * v1.Pod, oldPod * v1.Pod) {if c.enableEquivalenceClassCache {/ / if the pod does not have bound node, updating equivalence cache is meaningless; / / if pod's bound node has been changed, that case should be handled by pod add & delete. If len (newPod.Spec.NodeName)! = 0 & & newPod.Spec.NodeName = oldPod.Spec.NodeName {if! reflect.DeepEqual (oldPod.GetLabels (), newPod.GetLabels ()) {/ / MatchInterPodAffinity need to be reconsidered for this node, / / as well as all nodes in its same failure domain. C.equivalencePodCache.InvalidateCachedPredicateItemOfAllNodes (matchInterPodAffinitySet)} / / if requested container resource changed, invalidate GeneralPredicates of this node if! reflect.DeepEqual (predicates.GetResourceRequest (newPod) Predicates.GetResourceRequest (oldPod)) {c.equivalencePodCache.InvalidateCachedPredicateItem (newPod.Spec.NodeName, generalPredicatesSets)}}

UpdatePodInCache calls invalidateCachedPredicatesOnUpdatePod to do the following to Equivalence Cache:

If pod Labels is updated, the MatchInterPodAffinity PredicateCache in the Equivalence Cache on all nodes will be deleted

If the resource request of pod is updated, the GeneralPredicates PredicateCache in Equivalence Cache on that node will be deleted

Delete Pod in Scheduled Pod Cache

Similarly, when an assignedNonTerminatedPod is deleted, the invalidateCachedPredicatesOnDeletePod is called to update the Equivalence Cache.

Func (c * configFactory) invalidateCachedPredicatesOnDeletePod (pod * v1.Pod) {if c.enableEquivalenceClassCache {/ / part of this case is the same as pod add. C.equivalencePodCache.InvalidateCachedPredicateItemForPodAdd (pod, pod.Spec.NodeName) / / MatchInterPodAffinity need to be reconsidered for this node, / / as well as all nodes in its same failure domain. / / TODO (resouer) can we just do this for nodes in the same failure domain c.equivalencePodCache.InvalidateCachedPredicateItemOfAllNodes (matchInterPodAffinitySet) / / if this pod have these PV, cached result of disk conflict will become invalid. For _, volume: = range pod.Spec.Volumes {if volume.GCEPersistentDisk! = nil | | volume.AWSElasticBlockStore! = nil | | volume.RBD! = nil | | volume.ISCSI! = nil {c.equivalencePodCache.InvalidateCachedPredicateItem (pod.Spec.NodeName) NoDiskConflictSet)}}

The processing of invalidateCachedPredicatesOnDeletePod updating Equivalence Cache is summarized as follows:

Delete the GeneralPredicates PredicateCache in the Equivalence Cache on the node

If PVCs is referenced in the pod, the "MaxEBSVolumeCount", "MaxGCEPDVolumeCount" and "MaxAzureDiskVolumeCount" PredicateCaches in the Equivalence Cache on the node will be deleted.

If AWSElasticBlockStore is used in pod volume, the MaxEBSVolumeCount PredicateCache in Equivalence Cache on that node is deleted

If GCEPersistentDisk is used in pod volume, the MaxGCEPDVolumeCount PredicateCache in Equivalence Cache on that node is deleted

If AzureDisk is used in pod volume, the MaxAzureDiskVolumeCount PredicateCache in Equivalence Cache on that node is deleted

Delete MatchInterPodAffinity PredicateCache from Equivalence Cache on all nodes

If the resource request of pod is updated, the GeneralPredicates PredicateCache in Equivalence Cache on that node will be deleted

If one of GCEPersistentDisk, AWSElasticBlockStore, RBD, or ISCSI is referenced in the pod volume, delete the NoDiskConflict PredicateCache in the Equivalence Cache on that node.

Update Node

When node update event occurs, the correspondence calls invalidateCachedPredicatesOnNodeUpdate to update the Equivalence Cache.

Func (c * configFactory) invalidateCachedPredicatesOnNodeUpdate (newNode * v1.Node, oldNode * v1.Node) {if c.enableEquivalenceClassCache {/ / Begin to update equivalence cache based on node update / / TODO (resouer): think about lazily initialize this set invalidPredicates: = sets.NewString () if! reflect.DeepEqual (oldNode.Status.Allocatable NewNode.Status.Allocatable) {invalidPredicates.Insert (predicates.GeneralPred) / / "PodFitsResources"} if! reflect.DeepEqual (oldNode.GetLabels (), newNode.GetLabels ()) {invalidPredicates.Insert (predicates.GeneralPred, predicates.CheckServiceAffinityPred) / / "PodSelectorMatches" for k V: = range oldNode.GetLabels () {/ / any label can be topology key of pod We have to invalidate in all cases if v! = newNode.GetLabels () [k] {invalidPredicates.Insert (predicates.MatchInterPodAffinityPred)} / / NoVolumeZoneConflict will only be affected by zone related label change If isZoneRegionLabel (k) {if v! = newNode.GetLabels () [k] {invalidPredicates.Insert (predicates.NoVolumeZoneConflictPred)}} }} oldTaints OldErr: = helper.GetTaintsFromNodeAnnotations (oldNode.GetAnnotations ()) if oldErr! = nil {glog.Errorf ("Failed to get taints from old node annotation for equivalence cache")} newTaints NewErr: = helper.GetTaintsFromNodeAnnotations (newNode.GetAnnotations ()) if newErr! = nil {glog.Errorf ("Failed to get taints from new node annotation for equivalence cache")} if! reflect.DeepEqual (oldTaints, newTaints) | |! reflect.DeepEqual (oldNode.Spec.Taints NewNode.Spec.Taints) {invalidPredicates.Insert (predicates.PodToleratesNodeTaintsPred)} if! reflect.DeepEqual (oldNode.Status.Conditions NewNode.Status.Conditions) {oldConditions: = make (map [v1.NodeConditionType] v1.ConditionStatus) newConditions: = make (map [v1.NodeConditionType] v1.ConditionStatus) for _ Cond: = range oldNode.Status.Conditions {oldConditions [cond.Type] = cond.Status} for _ Cond: = range newNode.Status.Conditions {newConditions [cond.Type] = cond.Status} if oldConditions [v1.NodeMemoryPressure]! = newConditions [v1.NodeMemoryPressure] {invalidPredicates.Insert (predicates.CheckNodeMemoryPressurePred)} If oldConditions [v1.NodeDiskPressure]! = newConditions [v1.NodeDiskPressure] {invalidPredicates.Insert (predicates.CheckNodeDiskPressurePred)} if oldConditions [v1.NodeReady]! = newConditions [v1.NodeReady] | | oldConditions [v1.NodeOutOfDisk]! = newConditions [v1. NodeOutOfDisk] | | oldConditions [v1.NodeNetworkUnavailable]! = newConditions [v1.NodeNetworkUnavailable] {invalidPredicates.Insert (predicates.CheckNodeConditionPred)}} if newNode.Spec.Unschedulable! = oldNode.Spec.Unschedulable {invalidPredicates.Insert (predicates) .CheckNodeConditionPred)} c.equivalencePodCache.InvalidateCachedPredicateItem (newNode.GetName () InvalidPredicates)}}

Therefore, when you node update, the PredicateCache of the following PredicateKey in the Equivalence Cache corresponding to the node is deleted:

GeneralPredicates, premise: node.Status.Allocatable or node labels changes.

ServiceAffinity, premise: node labels has changed.

MatchInterPodAffinity, premise: node labels has changed.

NoVolumeZoneConflict, provided that failure-domain.beta.kubernetes.io/zone or failure-domain.beta.kubernetes.io/region Annotation is changed

PodToleratesNodeTaints, provided that the Taints of Node (corresponding to scheduler.alpha.kubernetes.io/taints Annotation) is changed.

CheckNodeMemoryPressure, CheckNodeDiskPressure, CheckNodeCondition, premise: if the corresponding NodeCondition changes.

Delete Node

When node delete event occurs, the correspondence calls InvalidateAllCachedPredicateItemOfNode to update the Equivalence Cache.

/ / InvalidateAllCachedPredicateItemOfNode marks all cached items on given node as invalidfunc (ec * EquivalenceCache) InvalidateAllCachedPredicateItemOfNode (nodeName string) {ec.Lock () defer ec.Unlock () delete (ec.algorithmCache, nodeName) glog.V (5) .Infof ("Done invalidating all cached predicates on node:% s", nodeName)}

Therefore, when you node delete, the algorthmCache corresponding to the entire node is deleted from the Equivalence Cache.

Add or Delete PV

When pv add or delete event occurs, the corresponding call invalidatePredicatesForPv to update the Equivalence Cache.

Func (c * configFactory) invalidatePredicatesForPv (pv * v1.PersistentVolume) {/ / You could have a PVC that points to a PV, but the PV object doesn't exist. / / So when the PV object gets added, we can recount. InvalidPredicates: = sets.NewString () / / PV types which impact MaxPDVolumeCountPredicate if pv.Spec.AWSElasticBlockStore! = nil {invalidPredicates.Insert (predicates.MaxEBSVolumeCountPred)} if pv.Spec.GCEPersistentDisk! = nil {invalidPredicates.Insert (predicates.MaxGCEPDVolumeCountPred)} if pv.Spec.AzureDisk! = nil {invalidPredicates.Insert (predicates.MaxAzureDiskVolumeCountPred) } / / If PV contains zone related label It may impact cached NoVolumeZoneConflict for k: = range pv.Labels {if isZoneRegionLabel (k) {invalidPredicates.Insert (predicates.NoVolumeZoneConflictPred) break}} if utilfeature.DefaultFeatureGate.Enabled (features.VolumeScheduling) {/ / Add/delete impacts the available PVs to choose from InvalidPredicates.Insert (predicates.CheckVolumeBindingPred)} c.equivalencePodCache.InvalidateCachedPredicateItemOfAllNodes (invalidPredicates)}

Therefore, when add or delete PV, the PredicateCache corresponding to the following predicateKey of all nodes is deleted from the Equivalence Cache:

MaxEBSVolumeCount, MaxGCEPDVolumeCount, MaxAzureDiskVolumeCount, premise: PV type is within the scope of these three

Update PV

When pv update event occurs, the correspondence calls invalidatePredicatesForPvUpdate to update the Equivalence Cache.

Func (c * configFactory) invalidatePredicatesForPvUpdate (oldPV, newPV * v1.PersistentVolume) {invalidPredicates: = sets.NewString () for k, v: = range newPV.Labels {/ / If PV update modifies the zone/region labels. If isZoneRegionLabel (k) & &! reflect.DeepEqual (v, oldPV.Labels [k]) {invalidPredicates.Insert (predicates.NoVolumeZoneConflictPred) break}} c.equivalencePodCache.InvalidateCachedPredicateItemOfAllNodes (invalidPredicates)}

Therefore, when update PV, the PredicateCache corresponding to the following predicateKey for all nodes is deleted from the Equivalence Cache:

NoVolumeZoneConflict, provided that the failure-domain.beta.kubernetes.io/zone or failure-domain.beta.kubernetes.io/region Annotation of PV has changed

Add or Delete PVCfunc (c * configFactory) invalidatePredicatesForPvc (pvc * v1.PersistentVolumeClaim) {/ / We need to do this here because the ecache uses PVC uid as part of equivalence hash of pod / / The bound volume type may change invalidPredicates: = sets.NewString (maxPDVolumeCountPredicateKeys...) / / The bound volume's label may change invalidPredicates.Insert (predicates.NoVolumeZoneConflictPred) if utilfeature.DefaultFeatureGate.Enabled (features.VolumeScheduling) { / / Add/delete impacts the available PVs to choose from invalidPredicates.Insert (predicates.CheckVolumeBindingPred)} c.equivalencePodCache.InvalidateCachedPredicateItemOfAllNodes (invalidPredicates)}

When a pvc add or delete event occurs, the PredicateCache corresponding to the following predicateKey for all nodes is deleted from the Equivalence Cache:

"MaxEBSVolumeCount", "MaxGCEPDVolumeCount", "MaxAzureDiskVolumeCount" PredicateCaches

NoVolumeZoneConflict PredicateCaches

CheckVolumeBinding, on the premise that the VolumeScheduling Feature Gate is enabled

Update PVCfunc (c * configFactory) invalidatePredicatesForPvcUpdate (old New * v1.PersistentVolumeClaim) {invalidPredicates: = sets.NewString () if old.Spec.VolumeName! = new.Spec.VolumeName {if utilfeature.DefaultFeatureGate.Enabled (features.VolumeScheduling) {/ / PVC volume binding has changed invalidPredicates.Insert (predicates.CheckVolumeBindingPred)} / / The bound volume type May change invalidPredicates.Insert (maxPDVolumeCountPredicateKeys...)} c.equivalencePodCache.InvalidateCachedPredicateItemOfAllNodes (invalidPredicates)}

When pvc update event occurs, the PredicateCache corresponding to the following predicateKey for all nodes is deleted from the Equivalence Cache:

CheckVolumeBinding. Premise: the Feature Gate of VolumeScheduling is enabled, and the PV corresponding to PVC is changed.

"MaxEBSVolumeCount", "MaxGCEPDVolumeCount", "MaxAzureDiskVolumeCount" PredicateCaches. Premise: the PV corresponding to PVC has changed.

Add or Delete Servicefunc (c * configFactory) onServiceAdd (obj interface {}) {if c.enableEquivalenceClassCache {c.equivalencePodCache.InvalidateCachedPredicateItemOfAllNodes (serviceAffinitySet)} c.podQueue.MoveAllToActiveQueue ()} func (c * configFactory) onServiceDelete (obj interface {}) {if c.enableEquivalenceClassCache {c.equivalencePodCache.InvalidateCachedPredicateItemOfAllNodes (serviceAffinitySet)} c.podQueue.MoveAllToActiveQueue ()}

When a Service Add or Delete event occurs, the PredicateCache corresponding to the following predicateKey for all nodes is deleted from the Equivalence Cache:

CheckServiceAffinity

Update Servicefunc (c * configFactory) onServiceUpdate (oldObj interface {}, newObj interface {}) {if c.enableEquivalenceClassCache {/ / TODO (resouer) We may need to invalidate this for specified group of pods only oldService: = oldObj. (* v1.Service) newService: = newObj. (* v1.Service) if! reflect.DeepEqual (oldService.Spec.Selector NewService.Spec.Selector) {c.equivalencePodCache.InvalidateCachedPredicateItemOfAllNodes (serviceAffinitySet)}} c.podQueue.MoveAllToActiveQueue ()}

When Service Update event occurs, the PredicateCache corresponding to the following predicateKey for all nodes is deleted from the Equivalence Cache:

CheckServiceAffinity, premise: the Selector of Service has changed.

The deficiency of Equivalence Class

The most difficult thing for Equivalence Class Feature is how to maintain and update Equivalence Cache optimally, so that every update is the smallest granularity and accurate. At present, this aspect needs to be optimized.

Equivalence Cache only caches Predicate Result and does not support caching and maintenance of Priority Result data (the community is implementing Map-Reduce-based optimization). In general, the processing logic of Priority Funcs is more complex than Predicate Funcs, so support is more meaningful.

Currently, Equivalence Class can only Equivalence Hash based on the OwnerReference and PVC information corresponding to Pod. If you can abandon the consideration of OwnerReference and take full account of those core field in Pod spec, such as resource request, Labels,Affinity, etc., the chance of cache hit may be much higher, and the performance of Predicate can be significantly improved.

At this point, I believe you have a deeper understanding of "what is the working mechanism of kubernetes to improve Scheduler throughput". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.