What is Kubernetes Scheduler's NominatedPods? 05/07 Update SLTechnology News&Howtos

What is Kubernetes Scheduler's NominatedPods?

2025-05-07 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces "what is the NominatedPods of Kubernetes Scheduler". In the daily operation, I believe that many people have doubts about what the NominatedPods of Kubernetes Scheduler is. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the doubts of "what is the NominatedPods of Kubernetes Scheduler?" Next, please follow the editor to study!

What is NominatedPods?

After enable PodPriority feature gate, scheduler will preempt the resources of the low-priority Pods (called victims) for preemptor when the cluster resources are insufficient, and then preemptor will join the scheduling queue again, waiting for the next graceful termination of victims and the next scheduling.

In order to avoid that the scheduler can perceive that those resources have been preempted during the period from the preemption of preemptor to the actual execution of scheduling again, when scheduler dispatches other lower priority Pods, it is considered that these resources have been preempted. Therefore, in the preemption phase, to set pod.Status.NominatedNodeName for preemptor indicates that preemption has occurred on NominatedNodeName, and preemptor expects scheduling on this node.

The NominatedPods on each node is cached in the PriorityQueue, and these NominatedPods represent the Pods that has been nominated by the node and is expected to be scheduled on the node, but has not been successfully dispatched yet.

What happened when you preempted the dispatch?

Let's focus on the process related to scheduler's preempt.

Func (sched * Scheduler) preempt (preemptor * v1.Pod, scheduleErr error) (string, error) {. Node, victims, nominatedPodsToClear, err: = sched.config.Algorithm.Preempt (preemptor, sched.config.NodeLister, scheduleErr). Var nodeName = "" if node! = nil {nodeName = node.Name err = sched.config.PodPreemptor.SetNominatedNodeName (preemptor, nodeName) if err! = nil {glog.Errorf ("Error in preemption process. Cannot update pod% v annotations:% v ", preemptor.Name, err) return", err}.} / / Clearing nominated pods should happen outside of" if node! = nil ". Node could / / be nil when a pod with nominated node name is eligible to preempt again, / / but preemption logic does not find any node for it. In that case Preempt () / / function of generic_scheduler.go returns the pod itself for removal of the annotation. For _, p: = range nominatedPodsToClear {rErr: = sched.config.PodPreemptor.RemoveNominatedNodeName (p) if rErr! = nil {glog.Errorf ("Cannot remove nominated node annotation of pod:% v", rErr) / / We do not return as this error is not critical. }} return nodeName, err}

Invoke ScheduleAlgorithm.Preempt performs resource preemption and returns the node,victims,nominatedPodsToClear where preemption occurs.

Func (g * genericScheduler) Preempt (pod * v1.Pod, nodeLister algorithm.NodeLister, scheduleErr error) (* v1.Node, [] * v1.Pod, [] * v1.Pod, error) {. CandidateNode: = pickOneNodeForPreemption (nodeToVictims) if candidateNode = = nil {return nil, err} nominatedPods: = g.getLowerPriorityNominatedPods (pod, candidateNode.Name) if nodeInfo, ok: = g.cachedNodeInfoMap [candidateNode.Name] Ok {return nodeInfo.Node (), nodeTovictims [candidateNode] .Pods, nominatedPods, err} return nil, fmt.Errorf ("preemption failed: the target node% s has been deleted from scheduler cache", candidateNode.Name)} func (g * genericScheduler) getLowerPriorityNominatedPods (pod * v1.Pod) NodeName string) [] * v1.Pod {pods: = g.schedulingQueue.WaitingPodsForNode (nodeName) if len (pods) = 0 {return nil} var lowerPriorityPods [] * v1.Pod podPriority: = util.GetPodPriority (pod) for _, p: = range pods {if util.GetPodPriority (p)

< podPriority { lowerPriorityPods = append(lowerPriorityPods, p) } } return lowerPriorityPods} node：抢占发生的最佳node； victims：待删除的pods，以释放资源给preemptor； nominatedPodsToClear：那些将要被删除.Status.NominatedNodeName的Pods列表，这些Pods是首先是属于PriorityQueue中的nominatedPods Cache中的Pods，并且他们的Pod Priority要低于preemptor Pod Priority，意味着这些nominatedPods已经不再适合调度到之前抢占时选择的这个node上了。如果抢占成功（node非空），则调用podPreemptor.SetNominatedNodeName设置preemptor的.Status.NominatedNodeName为该node name，表示该preemptor期望抢占在该node上。 func (p *podPreemptor) SetNominatedNodeName(pod *v1.Pod, nominatedNodeName string) error { podCopy := pod.DeepCopy() podCopy.Status.NominatedNodeName = nominatedNodeName _, err := p.Client.CoreV1().Pods(pod.Namespace).UpdateStatus(podCopy) return err } 无论抢占是否成功（node是否为空），nominatedPodsToClear都可能不为空，都需要遍历nominatedPodsToClear内的所有Pods，调用podPreemptor.RemoveNominatedNodeName将其.Status.NominatedNodeName设置为空。 func (p *podPreemptor) RemoveNominatedNodeName(pod *v1.Pod) error { if len(pod.Status.NominatedNodeName) == 0 { return nil } return p.SetNominatedNodeName(pod, "") } Preemptor抢占成功后，发生了什么？ Premmptor抢占成功后，该Pod会被再次加入到PriorityQueue中的Unschedulable Sub-Queue队列中，等待条件再次出发调度。关于这部分内容更深入的解读，请参考我的博客深入分析Kubernetes Scheduler的优先级队列。preemptor再次会通过podFitsOnNode对node进行predicate逻辑处理。 func podFitsOnNode( pod *v1.Pod, meta algorithm.PredicateMetadata, info *schedulercache.NodeInfo, predicateFuncs map[string]algorithm.FitPredicate, ecache *EquivalenceCache, queue SchedulingQueue, alwaysCheckAllPredicates bool, equivCacheInfo *equivalenceClassInfo,) (bool, []algorithm.PredicateFailureReason, error) { var ( eCacheAvailable bool failedPredicates []algorithm.PredicateFailureReason ) predicateResults := make(map[string]HostPredicate) podsAdded := false for i := 0; i < 2; i++ { metaToUse := meta nodeInfoToUse := info if i == 0 { podsAdded, metaToUse, nodeInfoToUse = addNominatedPods(util.GetPodPriority(pod), meta, info, queue) } else if !podsAdded || len(failedPredicates) != 0 { // 有问题吧？应该是podsAdded，而不是!podsAdded break } // Bypass eCache if node has any nominated pods. // TODO(bsalamat): consider using eCache and adding proper eCache invalidations // when pods are nominated or their nominations change. eCacheAvailable = equivCacheInfo != nil && !podsAdded for _, predicateKey := range predicates.Ordering() { var ( fit bool reasons []algorithm.PredicateFailureReason err error ) func() { var invalid bool if eCacheAvailable { ... } if !eCacheAvailable || invalid { // we need to execute predicate functions since equivalence cache does not work fit, reasons, err = predicate(pod, metaToUse, nodeInfoToUse) if err != nil { return } ... } }() ... } } } return len(failedPredicates) == 0, failedPredicates, nil} 一共会尝试进行两次predicate：第一次predicate时，调用addNominatedPods，遍历PriorityQueue nominatedPods中所有Pods，将那些PodPriority大于等于该调度Pod的优先级的所有nominatedPods添加到SchedulerCache的NodeInfo中，意味着调度该pod时要考虑这些高优先级nominatedPods进行预选，比如要减去它们的resourceRequest等，并更新到PredicateMetadata中，接着执行正常的predicate逻辑。第二次predicate时，如果前面的predicate逻辑有失败的情况，或者前面的podsAdded为false（如果在addNominatedPods时，发现该node对应nominatedPods cache是空的，那么返回值podAdded为false），那么第二次predicate立马结束，并不会触发真正的predicate逻辑。第二次predicate时，如果前面的predicate逻辑都成功，并且podAdded为true的情况下，那么需要触发真正的第二次predicate逻辑，因为nominatedPods的添加成功，可能会Inter-Pod Affinity会影响predicate结果。下面是addNominatedPods的代码，负责生成临时的schedulercache.NodeInfo和algorithm.PredicateMetadata，提供给具体的predicate Function进行预选处理。 // addNominatedPods adds pods with equal or greater priority which are nominated// to run on the node given in nodeInfo to meta and nodeInfo. It returns 1) whether// any pod was found, 2) augmented meta data, 3) augmented nodeInfo.func addNominatedPods(podPriority int32, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo, queue SchedulingQueue) (bool, algorithm.PredicateMetadata, *schedulercache.NodeInfo) { if queue == nil || nodeInfo == nil || nodeInfo.Node() == nil { // This may happen only in tests. return false, meta, nodeInfo } nominatedPods := queue.WaitingPodsForNode(nodeInfo.Node().Name) if nominatedPods == nil || len(nominatedPods) == 0 { return false, meta, nodeInfo } var metaOut algorithm.PredicateMetadata if meta != nil { metaOut = meta.ShallowCopy() } nodeInfoOut := nodeInfo.Clone() for _, p := range nominatedPods { if util.GetPodPriority(p) >

= podPriority {nodeInfoOut.AddPod (p) if metaOut! = nil {metaOut.AddPod (p, nodeInfoOut)} return true, metaOut, nodeInfoOut} / / WaitingPodsForNode returns pods that are nominated to run on the given node / / but they are waiting for other pods to be removed from the node before they// can be actually scheduled.func (p * PriorityQueue) WaitingPodsForNode (nodeName string) [] * v1.Pod {p.lock.RLock () defer p.lock.RUnlock () if list, ok: = p.nominatedPods [nodeName] Ok {return list} return nil}

The logic of addNominatedPods is as follows:

Call WaitingPodsForNode to get the nominatedPods cache data on the node in PriorityQueue. If nominatedPods is empty, podAdded is returned as false,addNominatedPods process end.

Clone the PredicateMeta and NodeInfo objects, traverse the nominatedPods, add the nominated pod whose priority is not lower than the scheduled pod to the cloned NodeInfo object one by one, and update it to the cloned PredicateMeta object. These cloned NodeInfo and PredicateMeta objects will eventually be passed into predicate Functions for pre-selection. When the traversal is complete, podAdded (true) and NodeInfo and PredicateMeta objects are returned.

How to maintain PriorityQueue NominatedPods Cache

In-depth analysis of the priority queue of Kubernetes Scheduler, this paper analyzes the operation of PriorityQueue in the EventHandler registered by podInformer, nodeInformer, serviceInformer, pvcInformer and so on in scheduler. The EventHandler related to NominatedPods is as follows.

Add Pod to PriorityQueue

When the Pod is added to the active queue in the PriorityQueue, the corresponding addNominatedPodIfNeeded will be called to delete the pod to be added from the PriorityQueue nominatedPods Cache, and then re-add it to the nominatedPods cache.

/ / Add adds a pod to the active queue. It should be called only when a new pod// is added so there is no chance the pod is already in either queue.func (p * PriorityQueue) Add (pod * v1.Pod) error {p.lock.Lock () defer p.lock.Unlock () err: = p.activeQ.Add (pod) if err! = nil {glog.Errorf ("Error adding pod% v to the scheduling queue:% v", pod.Name Err)} else {if p.unschedulableQ.get (pod)! = nil {glog.Errorf ("Error: pod v is already in the unschedulable queue." Pod.Name) p.deleteNominatedPodIfExists (pod) p.unschedulableQ.delete (pod)} p.addNominatedPodIfNeeded (pod) p.cond.Broadcast ()} return err} func (p * PriorityQueue) addNominatedPodIfNeeded (pod * v1.Pod) {nnn: = NominatedNodeName (pod) If len (nnn) > 0 {for _ Np: = range p.nominatedPods [nnn] {if np.UID = = pod.UID {glog.Errorf ("Pod% v already exists in the nominated map% v already exists in the nominated map!", pod.Namespace Pod.Name) return}} p.nominatedPods [nnn] = append (p.nominatedPods [nnn], pod)}}

When the Pod is added to the unSchedulableQ queue in the PriorityQueue, the corresponding addNominatedPodIfNeeded is called to add / update the pod to be added to the PriorityQueue nominatedPods Cache.

Func (p * PriorityQueue) AddUnschedulableIfNotPresent (pod * v1.Pod) error {p.lock.Lock () defer p.lock.Unlock () if p.unschedulableQ.get (pod)! = nil {return fmt.Errorf ("pod is already present in unschedulableQ")} if _, exists, _: = p.activeQ.Get (pod) Exists {return fmt.Errorf ("pod is already present in the activeQ")} if! p.receivedMoveRequest & & isPodUnschedulable (pod) {p.unschedulableQ.addOrUpdate (pod) p.addNominatedPodIfNeeded (pod) return nil} err: = p.activeQ.Add (pod) if err = nil { P.addNominatedPodIfNeeded (pod) p.cond.Broadcast ()} return err}

Note that the prerequisite for adding pod to nominatedPods cache is that the .Status.NominatedNodeName of the pod is not empty.

Update Pod in PriorityQueue

When the Pod in PriorityQueue is updated, updateNominatedPod is then called to update the nominatedPods Cache in PriorityQueue.

/ / Update updates a pod in the active queue if present. Otherwise, it removes// the item from the unschedulable queue and adds the updated one to the active// queue.func (p * PriorityQueue) Update (oldPod, newPod * v1.Pod) error {p.lock.Lock () defer p.lock.Unlock () / / If the pod is already in the active queue, just update it there. If _, exists, _: = p.activeQ.Get (newPod); exists {p.updateNominatedPod (oldPod, newPod) err: = p.activeQ.Update (newPod) return err} / / If the pod is in the unschedulable queue, updating it may make it schedulable. If usPod: = p.unschedulableQ.get (newPod) UsPod! = nil {p.updateNominatedPod (oldPod, newPod) if isPodUpdated (oldPod NewPod) {p.unschedulableQ.delete (usPod) err: = p.activeQ.Add (newPod) if err = = nil {p.cond.Broadcast ()} return err } p.unschedulableQ.addOrUpdate (newPod) return nil} / / If pod is not in any of the two queue We put it in the active queue. Err: = p.activeQ.Add (newPod) if err = = nil {p.addNominatedPodIfNeeded (newPod) p.cond.Broadcast ()} return err}

The logic for updateNominatedPod to update PriorityQueue nominatedPods Cache is to delete oldPod first, and then add newPod to it.

/ / updateNominatedPod updates a pod in the nominatedPods.func (p * PriorityQueue) updateNominatedPod (oldPod, newPod * v1.Pod) {/ / Even if the nominated node name of the Pod is not changed, we must delete and add it again / / to ensure that its pointer is updated. P.deleteNominatedPodIfExists (oldPod) p.addNominatedPodIfNeeded (newPod)} Delete Pod from PriorityQueue

Before deleting a Pod from a PriorityQueue, deleteNominatedPodIfExists is called to delete the pod from the PriorityQueue nominatedPods cache.

/ / Delete deletes the item from either of the two queues. It assumes the pod is// only in one queue.func (p * PriorityQueue) Delete (pod * v1.Pod) error {p.lock.Lock () defer p.lock.Unlock () p.deleteNominatedPodIfExists (pod) err: = p.activeQ.Delete (pod) if err! = nil {/ / The item was probably not found in the activeQ. P.unschedulableQ.delete (pod)} return nil}

When deleteNominatedPodIfExists, first check whether the .Status.NominatedNodeName of the pod is empty:

If it is empty, nothing is done and return ends the process directly.

If it is not empty, the nominatedPods cache is traversed, and once a UID matching pod is found, the pod exists in the nominatedPods, and then the pod is removed from the cache. If, after deletion, it is found that there is no nominatePods on the NominatedNode corresponding to the pod, the nominatedPods of the entire node is deleted from the map cache.

Func (p * PriorityQueue) deleteNominatedPodIfExists (pod * v1.Pod) {nnn: = NominatedNodeName (pod) if len (nnn) > 0 {for I, np: = range p.nominatedPods [nnn] {if np.UID = = pod.UID {p.nominatedPods [nnn] = append (p.nominatedPods [nnn] [: I] P.nominatedPods [nnn] [iTunes 1:]...) If len (p.nominatedPods [NNN]) = 0 {delete (p.nominatedPods, nnn)} break}} this ends the study of "what is the NominatedPods of Kubernetes Scheduler"? I hope I can solve everyone's doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.