Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the implementation of GPU type scheduling based on Kubernetes

2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)06/01 Report--

This article will explain in detail about the implementation of GPU type scheduling based on Kubernetes. The content of the article is of high quality, so the editor will share it with you for reference. I hope you will have some understanding of the relevant knowledge after reading this article.

Industry background

Nowadays, as companies invest more in machine learning and deep learning, they are beginning to find it difficult to build an AI system from scratch.

Take deep learning as an example. For deep learning, arithmetic is the foundation of everything. In order to use massive data to train better models and accelerate the whole process, the enterprise's IT system needs to have the ability to quickly and efficiently call and manage large-scale GPU resources. At the same time, because computing resources are very expensive, for cost control, enterprises also need to maximize the utilization of GPU resources through distributed training.

In the face of these new requirements, the cloud native technology based on Kubernetes provides a new working mode for artificial intelligence. With its features, Kubernetes seamlessly extends model training, inference, and deployment to multi-cloud GPU clusters, allowing data scientists to automate the deployment, maintenance, scheduling, and operation of multiple GPU accelerated application containers across cluster nodes.

In version 1.6 and version 1.9, Kubernetes has successively provided support for the management and scheduling of NVIDIA GPU and AMD GPU container clusters, further improving the ability to manage and schedule extended resources such as GPU.

However, as the foundation of the new generation of AI development, Kubernetes also has some defects. When allocating arithmetic resources to a training task, it usually randomly assigns the GPU of the node where the container is located, rather than specifying the use of a certain type of GPU.

While this is sufficient for most deep learning model training scenarios, the ability of data scientists to use higher performance or a certain type of GPU,Kubernetes is somewhat stretched if they want to be more flexible.

How to realize the scheduling of GPU type flexibly based on Kubernetes.

Community programme

Question: how does native Kubernetes get Pod to use a specified type of GPU?

Suppose there are two nodes in the cluster that have GPU: two Tesla K80 on node An and two Tesla P100 on node B. Kubernetes can schedule Pod to the appropriate node through Node Label and Node Selector, as shown below.

First type Node with a specific Label:

# Label your nodes with the accelerator type they have.$ kubectl label nodes node-an accelerator=nvidia-tesla-k80 $kubectl label nodes node-b accelerator=nvidia-tesla-p100

At this point, node An is as follows:

$kubectl describe node node-aName: node-aRoles: Labels:... Beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux kubernetes.io/hostname=node-an accelerator=nvidia-tesla-k80Annotations: kubeadm.alpha.kubernetes.io/cri-socket: / var/run/dockershim.sock.

When Pod wants to use NVIDIA Tesla K80 GPU, it can be done in the following ways:

ApiVersion: v1kind: Podmetadata: name: cuda-vector-addspec: containers:-name: cuda-vector-add image: "k8s.gcr.io/cuda-vector-add:v0.1" resources: limits: nvidia.com/gpu: 1 nodeSelector: accelerator: nvidia-tesla-k80

The above approach seems to solve the problem, but it actually treats the symptoms rather than the root causes.

Imagine if the user cluster has multiple GPU mounted on the same node, how can we implement filtering? What if a user mounts multiple NVIDIA Tesla K80 with different video memory on the same node and wants to use a GPU larger than 10GiB video memory?

Kubernetes's Node Label and Node Selector cannot solve these problems.

In the upstream community, many developers often discuss such issues, but there has been no practical solution. In spite of this, the community still provides a lot of wonderful insights. For example, here is one of the most discussed solutions in the community, and our plan also draws on some of the designs.

New ResourceClass API is added to match the extended resources in the cluster. For more information, please see below.

Modify Node API to add field description extension resources to NodeStatus:

Type NodeStatus struct {... ComputeResources [] ComputeResource} type ComputeResource struct {/ / unique and deterministically generated. "resourceName-propertyHash" naming convention, / / where propertyHash is generated by calculating a hash over all resource properties Name string / / raw resource name. E.g.: nvidia.com/nvidia-gpu ResourceName string / / resource metadata received from device plugin. / / e.g., gpuType: k80, zone: us-west1-b Properties map[string] string / / list of deviceIds received from device plugin. / / e.g., ["nvidia0", "nvidia1"] Devices [] string / / similar to the above but only contains allocatable devices. AllocatableDevices [] string}

The extended resource registers its information with the Kubelet component through Device Plugin API, and then the Kubelet component can update the node status with the extended resource information received, that is, the ComputeResources field in the previous step

The scheduler filters and selects the appropriate nodes according to the definition of ResourceClass. The scheduler listens for NodeStatus.ComputeResources changes and caches the allocation information of the ComputeResource on the node so that the ResourceClass matches the appropriate node.

Community solutions are more mature than Node Label and Node Selector. However, it is not difficult to see that although this scheme can modify the core code of Kubernetes and the core API, as a solution to technical problems of great concern, its progress is very slow and no further conclusion has been drawn.

Caiyun Technology: implementation of GPU type scheduling

In order to use a specified type of GPU in Pod as soon as possible and integrate it into Caicloud Compass, we propose a new scheme based on the upstream community scheme.

It makes full use of the extensibility and plug-in mechanism of Kubernetes, and follows the design principles of minimum intrusion and convenient migration. However, for reasons such as simplifying user use and reducing the difficulty of development and maintenance, it still modifies the Kubelet and Scheduler components.

At the same time, because we adopt the implementation of multi-scheduler, the modification of the Scheduler component in the scheme does not affect the existing cluster and the subsequent version upgrade, while the Kubelet component adopts backward compatible modification, which does not affect the applications that are already running in the cluster.

This scheme not only supports GPU resources, but also supports extended resources such as Infiniband, FPGAs, etc. It relies on the following existing Kubernetes working mechanisms:

Scheduler Extender mechanism

Device Plugin mechanism

API Server extension Mechanism (CRD)

Admission extension Mechanism (ResourceQuota)

In version 1.6, Kubernetes can create custom resources through ThirdPartyResource (TPR), but in version 1.7, it introduces an alternative to TPR: CustomResourceDefinition (CRD).

CRD allows you to customize a resource type, so developers no longer need to modify the Kubernetes core API or add new resources through API server aggregation, making it much easier to develop and maintain.

In our scenario, we define two resources through CRD: ExtendedResource and ResourceClass. ExtendedResource describes an extension resource. For example, NVIDIA GPU;ResourceClass defines which extension resource the container chooses. It is used in a manner similar to ExtendedResource in Kubernetes (see Resources for details). Users can specify it directly in the container, just like using CPU and Memory.

The following is the basic architecture diagram of Caiyun solution:

Core module one: Scheduler Extender. Taking advantage of the extensibility of the Scheduler component, Scheduler Extender is responsible for scheduling the Pod that uses the ResourceClass resource object in the container. It filters the ExtendedResource resource on the selected node by querying the definition of the ResourceClass object, finds the appropriate node and binds it, and writes the appropriate ExtendedResource to Pod Annotation for use by the Kubelet component. Because the extension mechanism of Scheduler Extender is implemented through HTTP, in order not to affect the performance of the default scheduler of the cluster, multi-scheduler is used to provide scheduling for Pod which only needs to use extended resources, and this way is portable.

Core module 2: Nvidia Device Plugin. This component is only for NVIDIA GPU extension resources, and in addition to communicating with Kubelet components, it is also responsible for creating and maintaining ExtendedResource resource objects.

So how does this solution address type specification when there are many different types of GPU on the same node?

Let's assume that there are two GPU on node A, one is NVIDIA Tesla K80 and the other is NVIDIA Tesla P100. Then the NVIDIA Device Plugin on this node creates two ExtendedResource resource objects that describe the basic attributes of the two cards, such as model, video memory, frequency, and so on. At the same time, it also registers with Kubelet to inform the Kubelet on node A that there are two Kubelet on node A.

At this point, if the user wants to create an application that uses the GPU of K80, he only needs to create a ResourceClass resource, declare in the ResourceClass to use the GPU of model NVIDIA Tesla K80 (for example, through Selector), and then use the ResourceClass resource in the container.

Kind: ResourceClassmetadata: name: nvidia.tesla.k80spec: selector: matchLabels: model: "NVIDIA Tesla K80" kind: Podmetadata: name: example-podspec: containers:-name: example-container resources: limits: nvidia.tesla.k80: 1

After a series of filtering, the Kubernetes default scheduler will call the Filter method of Scheduler Extender, and pass the Pod that needs to be scheduled and the filtered NodeList to Filter, so that ResourceClass can find the ExtendedResource that meets the requirements, so as to find the appropriate node.

When the scheduler finds the appropriate node, it calls the Bind method of Scheduler Extender, binds Pod and Node, and writes the appropriate ExtendedResource resource object to Pod Annotation for use by the Kubelet component.

When Pod and Node are bound, the Kubelet component on the node starts to create the container, gets information about which GPU the container needs to use through Pod Annotation, and then calls the Allocate method of NVIDIA Device Plugin through Device Plugin API.

The Allocate method parameter is the GPU DeviceID used by the container, which queries the information of the GPU through DeviceID as an environment variable and returns it to Kubelet to actually create the Pod.

As you can see from the above process, when we want to use a specific type of GPU or a certain type of GPU, we only need to declare that type of ResourceClass resource object, such as:

Kind: ResourceClassmetadata: name: nvidia.high.memspec: selector:-matchExpressions:-key: "memory" operator: "Gt" values:-"10GiB"

Further, we can automatically create a ResourceClass object for a type of ExtendedResource by implementing a Controller listening for ExtendedResource resources in the cluster, providing users with some default rules of ResourceClass resource objects.

In the actual production cluster environment, we not only need to meet the use of resources by different applications, but also to achieve the restrictions on the use of resources by different applications, and to allocate different resources to different namespace. In Kubernetes, we usually use ResourceQuota resource objects to restrict resources of different namespace, such as:

Kind: ResourceQuotametadata: name: example-quota namespace: systemspec: hard: cpu: "10" memory: 20Gi nvidia.com/gpu: "5"

From the ResourceQuota definition above, we can see that the default namespace can use 5 NVIDIA GPU, but it does not limit which type of GPU should be used.

So how do we implement the restrictions on GPU types?

First of all, the use of extended resources such as GPU is scalar, so we can only limit the number of scalar resources to integers.

Second, from the above scenario, we know that a ResourceClass represents a type of extended resource, so the restriction on extended resources is actually the restriction on ResourceClass.

After understanding in this way, the problem is very simple and clear. The corresponding ResourceQuota is given directly below:

Kind: ResourceQuotametadata: name: example-quota namespace: systemspec: hard: cpu: "10" memory: 20Gi nvidia.tesla.k80: "5"

In addition to GPU type scheduling, this solution can also solve the GPU sharing problem. This is also a hot topic of discussion in the upstream community.

ExtendedResource resources contain information such as GPU frequency and video memory. When multiple containers want to use the same GPU, we can define a ResourceClass resource object and declare how much video memory is used in the ResourceClass (video memory is shared here). In this way, when the application is deployed, we only need to declare the use of the ResourceClass resource in the container, and then Scheduler Extender will filter the eligible ExtendedResource object and bind it to the appropriate node.

If we want to achieve resource sharing, we may need to record the usage of video memory in ExtendedResource for scheduling reference. Of course, the isolation and limitation of resources are not taken into account here, which requires separate implementation and further discussion.

On the Kubernetes-based GPU type scheduling implementation is shared here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report