How to use GPU for AI training in Kubernetes Cluster 07/03 Update SLTechnology News&Howtos

How to use GPU for AI training in Kubernetes Cluster

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "how to use GPU for AI training in Kubernetes clusters". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Matters needing attention

As of Kubernetes 1.8:

Support for GPU is only experimental, still stuck in Alpha features, which means that it is not recommended to use Kubernetes to manage and schedule GPU resources in a production environment.

Only NVIDIA GPUs is supported.

Pods cannot share the same GPU, and even different Containers within the same Pod cannot share the same GPU. This is by far the most difficult point for Kubernetes to accept GPU support. Because a piece of PU is very expensive, a training process usually can not make full use of a piece of GPU, which is bound to lead to a waste of GPU resources.

The number of GPU per Container request is either 0 or a positive integer and is not allowed to be a score, that is, requesting only part of the GPU is not supported.

Regardless of the computing power of different types of GPU, if you need to consider this, consider using NodeAffinity to interfere with the scheduling process.

Only support docker as container runtime, can use GPU, if you use rkt, etc., then you may have to wait.

Logic diagram

Currently, Kubernetes is mainly responsible for the detection and scheduling of GPU resources, and it is docker that really communicates with NVIDIA Driver, so the whole logical structure diagram is as follows:

Let kubelet discover GPU resources and be dispatched

Verify that the GPU server in the Kubernetes cluster has NVIDIA Drivers installed and loaded, and you can use nvidia-docker-plugin to confirm that Drivers is loaded.

Please refer to nvidia-docker 2.0 installation for how to install it.

How to determine NVIDIA Drivers Ready? Execute the command kubectl get node $GPU_Node_Name-o yaml to view the information of the Node. If you see. Status.capacity.alpha.kubernetes.io/nvidia-gpu: $Gpu_num, kubelet has successfully identified the local GPU resource through driver.

Make sure that the-- feature-gatesflag of each component of kube-apiserver, kube-controller-manager, kube-scheduler, kubelet, kube-proxy contains Accelerators=true (although not actually every component needs to be configured, such as kube-proxy)

Make sure to check whether your UEFI is enabled in BIOS. If so, please turn it off immediately, otherwise the nvidia driver may fail to install.

Follow Nvidia k8s-device-plugin

If you are using Kubernetes 1.8, you can also take advantage of the Alpha feature of kubernetes device plugin to allow third-party device plugin to discover and report resource information to kubelet,Nividia with a corresponding plugin. Please refer to nvidia k8s-device-plugin. Nvidia k8s-device-plugin is deployed to GPUs Server through DaemonSet. The following is the content of its yaml description file:

ApiVersion: extensions/v1beta1kind: DaemonSetmetadata: name: nvidia-device-plugin-daemonsetspec: template: metadata: labels: name: nvidia-device-plugin-ds spec: containers:-image: nvidia-device-plugin:1.0.0 name: nvidia-device-plugin-ctr imagePullPolicy: Never env:-name: NVIDIA_VISIBLE_DEVICES value: ALL-name : NVIDIA_DRIVER_CAPABILITIES value: utility Compute volumeMounts:-name: device-plugin mountPath: / var/lib/kubelet/device-plugins volumes:-name: device-plugin hostPath: path: / var/lib/kubelet/device-plugins

With regard to Kubernetes Device Plugin, I have the opportunity to write a separate blog post for in-depth analysis.

How to use GPU in Pod

Unlike cpu and memory, you must force explicit declaration of the GPU number you intend to use. By setting alpha.kubernetes.io/nvidia-gpu to the number of GPU you want to use in the resources.limits of container, it is enough to set it to 1. There should not be many training scenarios where a worker needs to monopolize a few GPU.

Kind: PodapiVersion: v1metadata: name: gpu-podspec: containers:-name: gpu-container-1 image: gcr.io/google_containers/pause:2.0 resources: limits: alpha.kubernetes.io/nvidia-gpu: 1 volumeMounts:-mountPath: / usr/local/nvidia name: nvidia volumes:-hostPath: path: / var/lib/nvidia-docker/volumes/nvidia_driver/384.98 name: nvidia

Note that the nvidia_driver on the host needs to be mounted to / usr/local/nvidia in the container through hostpath

Some students may already have questions: why do not see the setting of resources.requests, directly set resources.limits?

Students who are familiar with LimitRanger and Resource QoS in Kubernetes should find that this setting of GPU resources belongs to QoS to Guaranteed, that is to say:

You can only explicitly set limits, not requests, then requests is actually equal to limits.

You can display both limits and requests settings, but they must be equal.

You can't just display the settings requests without setting limits, which is a case of Burstable.

Note that in Kubernetes 1.8.0 Release version, there is a bug: setting GPU requests less than limits is allowed. For specific issue, please refer to Issue 1450. The code has been merged into v1.8.0-alpha.3. Please pay attention when using it. The following is the corresponding modification code.

Pkg/api/v1/validation/validation.gofunc ValidateResourceRequirements (requirements * v1.ResourceRequirements, fldPath * field.Path) field.ErrorList {. / / Check that request 0 {allErrs = append (allErrs, field.Invalid (reqPath, quantity.String (), fmt.Sprintf ("must be less than or equal to% s limit" ResourceName))} else if resourceName = = v1.ResourceNvidiaGPU {allErrs = append (allErrs, field.Invalid (reqPath, quantity.String (), fmt.Sprintf ("must be equal to% s request", v1.ResourceNvidiaGPU))}} return allErrs}

For more information about Kubernetes Resource QoS, please refer to my other blog post: Kubernetes Resource QoS Mechanism interpretation.

Enhanced GPU scheduling using NodeAffinity

As mentioned earlier, Kubernetes does not support GPU hardware differentiation and differentiated scheduling by default, and if you need this effect, you can do it through NodeAffinity or using NodeSelector (however, NodeAffinity can implement NodeSelector and is much more powerful, NodeSelector should Deprecated soon. )

First, type the corresponding Label to the GPU server. You can do this in two ways:

Add-node-labels='alpha.kubernetes.io/nvidia-gpu-name=$NVIDIA_GPU_NAME', to the kubelet startup flag of course you can replace it with other custom key, but pay attention to readability. In this way, you need to restart kubelet to take effect, and it belongs to static mode.

Modify the corresponding Node information through rest client and add the corresponding Label. For example, execute kubectl label node $GPU_Node_Name alpha.kubernetes.io/nvidia-gpu-name=$NVIDIA_GPU_NAME, which takes effect in real time and can be added and deleted at any time, which is dynamic.

Then, add the relevant content that the corresponding NodeAffinity Type is requiredDuringSchedulingIgnoredDuringExecution to the Pod Spec that needs to use the specified GPU hardware, as shown below:

Kind: podapiVersion: v1metadata: annotations: scheduler.alpha.kubernetes.io/affinity: > {"nodeAffinity": {"requiredDuringSchedulingIgnoredDuringExecution": {"nodeSelectorTerms": [{"matchExpressions": [{"key": "alpha.kubernetes.io/nvidia-gpu-name" "operator": "In", "values": ["Tesla K80" "Tesla P100"]} spec: containers:-name: gpu-container-1 resources: limits: alpha.kubernetes.io/nvidia-gpu: 1 volumeMounts:-mountPath: / usr/local/nvidia name: nvidia volumes: -hostPath: path: / var/lib/nvidia-docker/volumes/nvidia_driver/384.98 name: nvidia where Tesla K80 The Tesla P100 is all NVIDIA GPU models. Use CUDA Libs

Typically, CUDA Libs is installed on a GPU server, so Pod that uses GPU can use CUDA Libs in a way that volume type is hostpath.

Refer to how to land TensorFlow on Kubernetes to run TensorFlow in Kubernetes cluster, and you can create Distributed TensorFlow cluster to start training.

The difference is that in the Job yaml corresponding to worker, follow the above description:

Replace docker image with tensorflow:1.3.0-gpu

Add GPU resources limits to container and remove the relevant resources requests settings for cpu and memory

And mount the corresponding CUDA libs, and then you can use / device:GPU:1, / device:GPU:2,... For accelerated training.

This is the end of the content of "how to use GPU for AI training in Kubernetes cluster". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.