Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Deep learning batch task processing scheduler merges with kubernetes default scheduler

2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)06/02 Report--

Three-step installation of kubernetes cluster

What is a batch task?

In deep learning, there are often multi-computer and multi-card tasks, that is, colleagues will play multiple pod, but these multiple pod belong to the same task.

Then there will be a problem.

A task needs 100 pod, each pod needs one card, and a total of 100 GPU cards are needed, while there are only 99 GPU cards available in the cluster, so what will the default k8s scheduler do?

Because the default scheduler is scheduled by a pod, it only checks whether a single pod resource is sufficient, so that the first 99 are successful, and the last pod scheduling fails.

This is very likely to cause

The first 99 tasks will not be released while the first 99 tasks will not be released, and when the new tasks cannot be dispatched, the whole cluster will be deadlocked and the manger will be occupied without shit.

Therefore, all the resources required by the entire task need to be checked during scheduling. When the overall resources of the cluster are insufficient, no pod can be scheduled.

The community provides a scheduler that supports this feature.

But this scheduler doesn't work well with the native scheduler.

The biggest problem is that both schedulers have cache, so the contents of the cache will conflict, resulting in scheduling confusion. This scheduler cannot work with the native scheduler at the same time, so after using this batch scheduler, it will not be able to use the affinity feature.

So what we do is to integrate the two features, and the chosen method is to customize the development of kube-scheduler.

In fact, scheduler can be extended through extender, but extender is still too weak to add its own filtering strategy to pre-selection and selection, which is not enough for batch tasks.

Difficulties in implementation

Add batch task check when preferred

Get a pod-> if it is a batchpod-> query whether the cluster resource satisfies the batch task-> No scheduling failed

Need to ensure that other pod in the batch task can be scheduled

If the cluster resource can satisfy this batch task and go directly to bind, there is a problem:

Suppose the scheduling queue is like this, suppose there are three GPU in the cluster, and the batch task requires three GPU:

A batch pod-> A batch pod-> A batch pod cluster resources are enough to schedule successfully, other pod is scheduled, other podGPU is occupied by other pod, GPU is not enough failure GPU is not enough failure

So the end result is that batch A takes up a GPU, but the whole task fails to be scheduled, and that GPU cannot be released.

So you need to modify the order in the pod scheduling queue? Let A batch pod schedule continuously? It's not that simple.

Pod scheduling creates concurrent scheduling, so even if you adjust the order of pod in the task queue, it does not necessarily guarantee that other pod of batch tasks can be scheduled first.

Go wait.Until (sched.scheduleOne, 0, sched.config.StopEverything)

As long as batch pod goes to Bind logic, there is no turning back.

All pod in the batch task are scheduled by assume first, and any one of them fails to clean up the other pod that is already bind but not actually scheduled. And throw all the pod back to the queue, or directly return to the pod of the failed cleanup task and let the upper layer trigger it again?

Scheduler process scheduler/sheduler.go scheduleOne logic:

Select Node-> cache assume pod on node- > create Collaborative bind

Therefore, it is not feasible to check during the assume that the scheduled pod is not returned, because the pod in the previous batch task may have been bind, so only the last pod in the batch task can go to the pod in front of the bind.

Preoccupation policy

Preemption policy: when the first batch pod task arrives, check whether the cluster resources are sufficient. If there is enough preemption, mark the other node so that the next pod cannot occupy other node, so that the batch task actually has nodes available over the pod.

Back to the problem of not being able to bind.

There are two points to this problem:

How to know what kind of nodes other pod needs in the batch task? if the pod is all the same, the problem can be simplified.

If the later pod fails, the first pod is still bind, and the same problem will occur.

In the end, you cannot bind a single pod assume before all pod.

To sum up, it needs to be dealt with in several places

It is best to use priority queues to increase the priority of the associated pod of the pod being scheduled

Make a judgment when selecting nodes to see if the cluster resources are sufficient.

Check when selecting the node assume pod. If you do not have enough or the pod group is not enough, you will not go to bind.

The problem is that the previous pod has already gone through the bind process, so the most important thing is how to prevent the previous pod from going to bind and delay bind.

Final solution-delayed binding

Solution: special handling during batch task bind

If the batch task is thrown into the task cache, do not binding. If the last pod of the batch task is thrown into the task cache, the task ready is put into the bind queue and the task is taken from the bind queue for bind,task mutex and normal pod bind mutual exclusion.

Use

The batch task is used, and pod adds two comments:

Annotations: scheduling.k8s.io/group-name: qj-1 scheduling.k8s.io/group-pod-num: 3

Pod plus these two annotations indicate that they belong to the same task, and num indicates how many pod there are in the task.

Originally, another CRD is defined to describe the task, the coupling will be less, but the implementation is more troublesome, need to listen to one more CRD, lazy did not do so

Realize

Deferred binding process:

If it is an ordinary pod, the assume will directly bind after finding the node. If it is a batch task, throw it directly into the batch cache and return a cooperative program that always checks whether there is a successful task in the batch cache (all the pod is complete). The successful task is thrown into the binding queue, and the worker takes the successful task for batch binding and mutually exclusive with the ordinary pod when binding.

Batch scheduler interface and members

Run starts a co-program to check the successful task and plugs it into the queue

RunBind starts a task binding protocol

PodQuePriority dynamically modifies the priority of pod queues to give priority to pod scheduling with task

Perform the process:

Delayed binding

Scheduler/scheduler.go:

/ / fanux if it is a batch pod, return if sched.Config.BatchScheduler.IsBatchPod (assumedPod) {err = sched.Config.BatchScheduler.HandleBatchPod (assumedPod) if err! = nil {glog.Errorf ("schedule batch pod failed:% v", assumedPod.Namespace, assumedPod.Name)} return}

Increased binding mutex to prevent batch tasks and ordinary pod colleagues from binding:

Go func () {/ / fanux add bind mutex sched.Config.BatchScheduler.Lock () defer sched.Config.BatchScheduler.UnLock () err: = sched.bind (assumedPod, & v1.Binding {check whether resources are sufficient or not CheckResourceIsEnough

Should't use filterFunc, needs nodelist

Scheduler/util/batch.go

Package utilimport "api/core/v1" / / CheckResourceIsEnough isfunc CheckResourceIsEnough (pod * v1.Pod, nodes [] * v1.Node) (bool, error) {return false, nil}

Scheduler/core/generic_scheduler.go

/ / fanux add checkBatchPodResource flag, err: = util.CheckResourceIsEnough (pod, filteredNodes) if! flag | | err! = nil {return ", err} trace.Step (" Prioritizing ")

Deal with the situation when resources are insufficient

SuggestedHost, err: = sched.schedule (pod) / / fanux add handle if resource not enough if strings.Contains (err.Error (), common.BatchResourceNotEnough) {sched.Config.BatchScheduler.HandleResourceNotEnough (pod)} else if err! = nil {how to get the number of GPU assigned to a node

NodeInfo allocatableResource-requestedResource is avaliavle resource

RequestedResource * Resource nonzeroRequest * Resource allocatableResource * Resource

GPU is ScalarResources, and the resource name is: NVIDIAGPUResourceName = "nvidia.com/gpu"

Type Resource struct {MilliCPU int64 Memory int64 EphemeralStorage int64 / / We store allowedPodNumber (which is Node.Status.Allocatable.Pods (). Value ()) / / explicitly as int, to avoid conversions and improve performance. AllowedPodNumber int / / ScalarResources ScalarResources map [v1.ResourceName] int64} add podupdater to update podcondition status batchScheduler: = batch.NewBatchScheduler (c.schedulerCache, c.podQueue, & binder {c.client}, & podConditionUpdater {c.client}) when you need to send batch scheduler cache to generic_scheduler resource check

You need to know which pod have already been assume. To subtract this number is how much more GPU is needed for the batch task.

Core/generic_scheduler.go

/ / fanux add batch Cache / / check batch pod resource is enough need batch scheduler cache BatchCache common.TaskCache / / fanux add checkBatchPodResource flag, err: = common.CheckResourceIsEnough (pod, filteredNodes, g.cachedNodeInfoMap, g.BatchCache)

Factory.go

/ / fanux check batch resource is enough need batch scheduler cache batchCache: = batchScheduler.GetTaskCache () algo: = core.NewGenericScheduler (... BatchCache,)

Then checkresource:

/ / shoud not use metadata, need use metadata-assumed pod num in batch cache _, podNum: = GetPodBathMeta (pod) podNum-= batchCache.GetTaskAssumedPodNum (pod) check whether resources are sufficient detailed algorithm:

There are many details.

/ / get the number of GPU required for pod This requires adding up the container quota in pod func GetPodGPUCount (pod * v1.Pod) (count int) {for _, c: = range pod.Spec.Containers {limit, ok: = c.Resources.Limits [NVIDIAGPUResourceName] l, okay: = limit.AsInt64 () if! ok |! okay {continue} count + = int (l)} glog.Infof ("Pod [% s] need GPU [% d]" Pod.GetName (), count) return} / / get node idle GPU You need to subtract the allocated func GetNodeFreeGPU (nodeInfo * cache.NodeInfo) int {if nodeInfo = = nil {return 0} allocatable, ok: = nodeInfo.AllocatableResource (). ScalarResources [NVIDIAGPUResourceName] if! ok {glog.Errorf ("can't fetch allocatable GPU:% v", nodeInfo) return 0} glog.Infof ("node [% s] allocatable GPU [% d]", nodeInfo.Node (). Name, allocatable) requested Ok: = nodeInfo.RequestedResource (). ScalarResources [NVIDIAGPUResourceName] if! ok {/ / glog.Errorf ("can't fetch requested GPU:% v", nodeInfo) / / return 0 requested = 0} glog.Infof ("node [% s] requested GPU [% d]", nodeInfo.Node () .Name, requested) available: = allocatable-requested glog.Infof ("available node [% s] GPU: [% d]", nodeInfo.Node (). Name) Available) return int (available)} / / the most important point here is to subtract the total number of task pod obtained from the annotations by the batch pod that has already been assume. This is the real need for func CheckResourceIsEnough (pod * v1.Pod, nodes [] * v1.Node, cachedNodeInfoMap map [string] * cache.NodeInfo, batchCache TaskCache) (bool, error) {/ / if is not batch pod, return true,nil if! IsBatch (pod) {glog.Infof ("pod% s is not batch pod", pod.GetName () return true,nil} / / shoud not use metadata, need use metadata-ready pod num in batch cache _ PodNum: = GetPodBathMeta (pod) podNum-= batchCache.GetTaskAssumedPodNum (pod) everyPodNeedsGPU: = GetPodGPUCount (pod) if everyPodNeedsGPU = 0 {glog.Infof ("pod% s require 0 GPU", pod.GetName () return true, nil} / / TODO maybe check nodes [1:], node [0] already allocate a pod, CPU and other metric may reach limit for _, node: = range nodes {nodeInfo Ok: = cachedNodeInfoMap [node.Name] if! ok {continue} nodeFree: = GetNodeFreeGPU (nodeInfo) podNum-= nodeFree / everyPodNeedsGPU glog.Infof ("pod: [% s] node: [% s] podNum [% d] nodeFree [% d] podNeed [% d]", pod.GetName (), node.Name, podNum, nodeFree, everyPodNeedsGPU) if podNum

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report