In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)06/02 Report--
Three-step installation of kubernetes cluster
What is a batch task?
In deep learning, there are often multi-computer and multi-card tasks, that is, colleagues will play multiple pod, but these multiple pod belong to the same task.
Then there will be a problem.
A task needs 100 pod, each pod needs one card, and a total of 100 GPU cards are needed, while there are only 99 GPU cards available in the cluster, so what will the default k8s scheduler do?
Because the default scheduler is scheduled by a pod, it only checks whether a single pod resource is sufficient, so that the first 99 are successful, and the last pod scheduling fails.
This is very likely to cause
The first 99 tasks will not be released while the first 99 tasks will not be released, and when the new tasks cannot be dispatched, the whole cluster will be deadlocked and the manger will be occupied without shit.
Therefore, all the resources required by the entire task need to be checked during scheduling. When the overall resources of the cluster are insufficient, no pod can be scheduled.
The community provides a scheduler that supports this feature.
But this scheduler doesn't work well with the native scheduler.
The biggest problem is that both schedulers have cache, so the contents of the cache will conflict, resulting in scheduling confusion. This scheduler cannot work with the native scheduler at the same time, so after using this batch scheduler, it will not be able to use the affinity feature.
So what we do is to integrate the two features, and the chosen method is to customize the development of kube-scheduler.
In fact, scheduler can be extended through extender, but extender is still too weak to add its own filtering strategy to pre-selection and selection, which is not enough for batch tasks.
Difficulties in implementation
Add batch task check when preferred
Get a pod-> if it is a batchpod-> query whether the cluster resource satisfies the batch task-> No scheduling failed
Need to ensure that other pod in the batch task can be scheduled
If the cluster resource can satisfy this batch task and go directly to bind, there is a problem:
Suppose the scheduling queue is like this, suppose there are three GPU in the cluster, and the batch task requires three GPU:
A batch pod-> A batch pod-> A batch pod cluster resources are enough to schedule successfully, other pod is scheduled, other podGPU is occupied by other pod, GPU is not enough failure GPU is not enough failure
So the end result is that batch A takes up a GPU, but the whole task fails to be scheduled, and that GPU cannot be released.
So you need to modify the order in the pod scheduling queue? Let A batch pod schedule continuously? It's not that simple.
Pod scheduling creates concurrent scheduling, so even if you adjust the order of pod in the task queue, it does not necessarily guarantee that other pod of batch tasks can be scheduled first.
Go wait.Until (sched.scheduleOne, 0, sched.config.StopEverything)
As long as batch pod goes to Bind logic, there is no turning back.
All pod in the batch task are scheduled by assume first, and any one of them fails to clean up the other pod that is already bind but not actually scheduled. And throw all the pod back to the queue, or directly return to the pod of the failed cleanup task and let the upper layer trigger it again?
Scheduler process scheduler/sheduler.go scheduleOne logic:
Select Node-> cache assume pod on node- > create Collaborative bind
Therefore, it is not feasible to check during the assume that the scheduled pod is not returned, because the pod in the previous batch task may have been bind, so only the last pod in the batch task can go to the pod in front of the bind.
Preoccupation policy
Preemption policy: when the first batch pod task arrives, check whether the cluster resources are sufficient. If there is enough preemption, mark the other node so that the next pod cannot occupy other node, so that the batch task actually has nodes available over the pod.
Back to the problem of not being able to bind.
There are two points to this problem:
How to know what kind of nodes other pod needs in the batch task? if the pod is all the same, the problem can be simplified.
If the later pod fails, the first pod is still bind, and the same problem will occur.
In the end, you cannot bind a single pod assume before all pod.
To sum up, it needs to be dealt with in several places
It is best to use priority queues to increase the priority of the associated pod of the pod being scheduled
Make a judgment when selecting nodes to see if the cluster resources are sufficient.
Check when selecting the node assume pod. If you do not have enough or the pod group is not enough, you will not go to bind.
The problem is that the previous pod has already gone through the bind process, so the most important thing is how to prevent the previous pod from going to bind and delay bind.
Final solution-delayed binding
Solution: special handling during batch task bind
If the batch task is thrown into the task cache, do not binding. If the last pod of the batch task is thrown into the task cache, the task ready is put into the bind queue and the task is taken from the bind queue for bind,task mutex and normal pod bind mutual exclusion.
Use
The batch task is used, and pod adds two comments:
Annotations: scheduling.k8s.io/group-name: qj-1 scheduling.k8s.io/group-pod-num: 3
Pod plus these two annotations indicate that they belong to the same task, and num indicates how many pod there are in the task.
Originally, another CRD is defined to describe the task, the coupling will be less, but the implementation is more troublesome, need to listen to one more CRD, lazy did not do so
Realize
Deferred binding process:
If it is an ordinary pod, the assume will directly bind after finding the node. If it is a batch task, throw it directly into the batch cache and return a cooperative program that always checks whether there is a successful task in the batch cache (all the pod is complete). The successful task is thrown into the binding queue, and the worker takes the successful task for batch binding and mutually exclusive with the ordinary pod when binding.
Batch scheduler interface and members
Run starts a co-program to check the successful task and plugs it into the queue
RunBind starts a task binding protocol
PodQuePriority dynamically modifies the priority of pod queues to give priority to pod scheduling with task
Perform the process:
Delayed binding
Scheduler/scheduler.go:
/ / fanux if it is a batch pod, return if sched.Config.BatchScheduler.IsBatchPod (assumedPod) {err = sched.Config.BatchScheduler.HandleBatchPod (assumedPod) if err! = nil {glog.Errorf ("schedule batch pod failed:% v", assumedPod.Namespace, assumedPod.Name)} return}
Increased binding mutex to prevent batch tasks and ordinary pod colleagues from binding:
Go func () {/ / fanux add bind mutex sched.Config.BatchScheduler.Lock () defer sched.Config.BatchScheduler.UnLock () err: = sched.bind (assumedPod, & v1.Binding {check whether resources are sufficient or not CheckResourceIsEnough
Should't use filterFunc, needs nodelist
Scheduler/util/batch.go
Package utilimport "api/core/v1" / / CheckResourceIsEnough isfunc CheckResourceIsEnough (pod * v1.Pod, nodes [] * v1.Node) (bool, error) {return false, nil}
Scheduler/core/generic_scheduler.go
/ / fanux add checkBatchPodResource flag, err: = util.CheckResourceIsEnough (pod, filteredNodes) if! flag | | err! = nil {return ", err} trace.Step (" Prioritizing ")
Deal with the situation when resources are insufficient
SuggestedHost, err: = sched.schedule (pod) / / fanux add handle if resource not enough if strings.Contains (err.Error (), common.BatchResourceNotEnough) {sched.Config.BatchScheduler.HandleResourceNotEnough (pod)} else if err! = nil {how to get the number of GPU assigned to a node
NodeInfo allocatableResource-requestedResource is avaliavle resource
RequestedResource * Resource nonzeroRequest * Resource allocatableResource * Resource
GPU is ScalarResources, and the resource name is: NVIDIAGPUResourceName = "nvidia.com/gpu"
Type Resource struct {MilliCPU int64 Memory int64 EphemeralStorage int64 / / We store allowedPodNumber (which is Node.Status.Allocatable.Pods (). Value ()) / / explicitly as int, to avoid conversions and improve performance. AllowedPodNumber int / / ScalarResources ScalarResources map [v1.ResourceName] int64} add podupdater to update podcondition status batchScheduler: = batch.NewBatchScheduler (c.schedulerCache, c.podQueue, & binder {c.client}, & podConditionUpdater {c.client}) when you need to send batch scheduler cache to generic_scheduler resource check
You need to know which pod have already been assume. To subtract this number is how much more GPU is needed for the batch task.
Core/generic_scheduler.go
/ / fanux add batch Cache / / check batch pod resource is enough need batch scheduler cache BatchCache common.TaskCache / / fanux add checkBatchPodResource flag, err: = common.CheckResourceIsEnough (pod, filteredNodes, g.cachedNodeInfoMap, g.BatchCache)
Factory.go
/ / fanux check batch resource is enough need batch scheduler cache batchCache: = batchScheduler.GetTaskCache () algo: = core.NewGenericScheduler (... BatchCache,)
Then checkresource:
/ / shoud not use metadata, need use metadata-assumed pod num in batch cache _, podNum: = GetPodBathMeta (pod) podNum-= batchCache.GetTaskAssumedPodNum (pod) check whether resources are sufficient detailed algorithm:
There are many details.
/ / get the number of GPU required for pod This requires adding up the container quota in pod func GetPodGPUCount (pod * v1.Pod) (count int) {for _, c: = range pod.Spec.Containers {limit, ok: = c.Resources.Limits [NVIDIAGPUResourceName] l, okay: = limit.AsInt64 () if! ok |! okay {continue} count + = int (l)} glog.Infof ("Pod [% s] need GPU [% d]" Pod.GetName (), count) return} / / get node idle GPU You need to subtract the allocated func GetNodeFreeGPU (nodeInfo * cache.NodeInfo) int {if nodeInfo = = nil {return 0} allocatable, ok: = nodeInfo.AllocatableResource (). ScalarResources [NVIDIAGPUResourceName] if! ok {glog.Errorf ("can't fetch allocatable GPU:% v", nodeInfo) return 0} glog.Infof ("node [% s] allocatable GPU [% d]", nodeInfo.Node (). Name, allocatable) requested Ok: = nodeInfo.RequestedResource (). ScalarResources [NVIDIAGPUResourceName] if! ok {/ / glog.Errorf ("can't fetch requested GPU:% v", nodeInfo) / / return 0 requested = 0} glog.Infof ("node [% s] requested GPU [% d]", nodeInfo.Node () .Name, requested) available: = allocatable-requested glog.Infof ("available node [% s] GPU: [% d]", nodeInfo.Node (). Name) Available) return int (available)} / / the most important point here is to subtract the total number of task pod obtained from the annotations by the batch pod that has already been assume. This is the real need for func CheckResourceIsEnough (pod * v1.Pod, nodes [] * v1.Node, cachedNodeInfoMap map [string] * cache.NodeInfo, batchCache TaskCache) (bool, error) {/ / if is not batch pod, return true,nil if! IsBatch (pod) {glog.Infof ("pod% s is not batch pod", pod.GetName () return true,nil} / / shoud not use metadata, need use metadata-ready pod num in batch cache _ PodNum: = GetPodBathMeta (pod) podNum-= batchCache.GetTaskAssumedPodNum (pod) everyPodNeedsGPU: = GetPodGPUCount (pod) if everyPodNeedsGPU = 0 {glog.Infof ("pod% s require 0 GPU", pod.GetName () return true, nil} / / TODO maybe check nodes [1:], node [0] already allocate a pod, CPU and other metric may reach limit for _, node: = range nodes {nodeInfo Ok: = cachedNodeInfoMap [node.Name] if! ok {continue} nodeFree: = GetNodeFreeGPU (nodeInfo) podNum-= nodeFree / everyPodNeedsGPU glog.Infof ("pod: [% s] node: [% s] podNum [% d] nodeFree [% d] podNeed [% d]", pod.GetName (), node.Name, podNum, nodeFree, everyPodNeedsGPU) if podNum
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.