How to realize GPU scheduling and sharing in Kubernetes 04/27 Update SLTechnology News&Howtos

How to realize GPU scheduling and sharing in Kubernetes

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces "how to achieve GPU scheduling and sharing in Kubernetes". In daily operation, I believe many people have doubts about how to achieve GPU scheduling and sharing in Kubernetes. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts about "how to achieve GPU scheduling and sharing in Kubernetes". Next, please follow the editor to study!

Introduction

In recent years, the prosperity and deepening of AI technology, especially the rise of deep learning, can not be separated from the improvement of massive data and computing power. In particular, the use of Nvidia's GPU, so that deep learning to get dozens of times the performance improvement, only completely opened the AI imagination space. Although smart chips have blossomed in recent years, the most typical is Google's TPU, but to be fair, Nvidia's GPU still dominates in terms of inclusive meaning and ecology.

However, the GPU of Nvidia is undoubtedly expensive, so how to maximize the use of GPU hardware resources is a problem that every computing platform product should consider. For example, when there are multiple users using GPU server for training, how to ensure a reasonable allocation of resources is very important. Thanks to the Runtime, or Nvidia-Docker, written for Docker by Nvidia, it is possible to use GPU in Docker. It is much easier to manage and use GPU from the container granularity than from the host point of view, because the configuration of AI tasks running GPU is usually very complex, which includes administrators managing the allocation of GPU cards and users switching different training environments, while containers can encapsulate different training environments, greatly reducing complexity. In addition, using Kubernetes to manage Nvidia-Docker makes the assignment of GPU tasks more simple and reasonable, and has become the scheme of almost all mainstream AI computing platforms.

Kubernetes supports Device-Plugin to increase support for devices other than the default resources (CPU,Memory, etc.), while third parties can increase support for devices by writing corresponding Device-Plugin. At present, Nvidia also supports GPU in this way.

K8s + Nvidia-Device-Plugin has two limitations: each GPU can only be used by one container at the same time, and the channel affinity between GPU cards is not taken into account. This approach has met some AI computing requirements scenarios, but there are still some shortcomings and limitations: each GPU can only be used by one container at the same time, which is no problem in training mode, but it will result in a huge waste of resources in development and debugging mode. Because most of the time in development and debugging mode, users do not have the resources to actually run GPU, but they monopolize the expensive GPU. In addition, under the multi-card host architecture, the direct connection of GPU cards is usually different, some are connected through Nvlink, some are through PCIe, and the performance of different connection methods is very different. Without considering the direct channel affinity of GPU cards in the same host, it will also bring excessive communication overhead when data transmission occurs in multi-card computing (such as all_reduce operation). Naturally, when multiple GPU cards are mounted in the same container, we certainly prefer to mount cards with better channel affinity.

This paper will introduce the general process of K8s for GPU scheduling and some of our transformation schemes. It includes Nvidia-Docker used to support container GPU mounting, Device-Plugin mechanism that uses GPU as an extension of resource scheduling in K8s, and a modification scheme for the problems existing in native Nvidia-Device-Plugin.

A brief introduction to Nvidia-Docker

Nvidia-Docker is an official extension of containers made by Nvidia to enable containers to support Nvidia GPU devices. According to official statistics, Nvidia-Docker has been downloaded more than 2 million times. It can be seen that it is a very mainstream practice to use Nvidia-Docker to do AI system environment.

Here will not introduce the principle of Nvidia-Docker in detail, the detailed principle can read its official design documents, here is only a brief introduction. Since 2015, Docker containers have produced a set of container runtime standards OCI (Open Containers Initiative), which includes container runtime standards (Runtime-Spec) and container image standards (Image-Spec). The famous Runc is a default implementation of this standard, but any implementation that meets this standard can be registered as a runtime extension of the container. Containerd wraps Runc and other functions such as lifecycle management and runs on the host in the form of Daemon. Nvidia-Docker is based on this set of extended standards to add container support for Nvidia GPU. The main principle of Nvidia-Docker is to put the support for GPU into an extended libnvidia-container that is compatible with the OCI standard, and to call it in the API of Runtime, and to support GPU in libnvidia-container by sharing and calling the nvidia-driver on the host side. When the container starts, Runc will call a hook called nvidia-container-runtime-hook, which will check whether the corresponding environment has GPU support and some environment checks. After completion, the container starts, and the processes in the container also interact with each other through the interface exposed by libnvidia-container at run time, thus realizing the transparent transmission and runtime support of the container to GPU.

(photo source: https://devblogs.nvidia.com/gpu-containers-runtime)

It is worth noting that the Nvidia-Docker container uses the Nvidia-Driver on the host side, and the upper software stack, such as the cuda/cudnn,AI framework, is provided in the container. In addition, multiple Nvidia-Docker can mount the same GPU, as long as it is specified by the environment variable, and there is no limit on the number.

To make it easier to understand the mechanisms behind Device-Plugin, here's a brief introduction to how Nvidia-Docker mounts different GPU devices. The use of Nvidia-Docker is very simple. It specifies the GPU device to be mounted by specifying an environment variable, but specify the Runtime of Docker as Nvidia-Docker in the configuration file of Docker, or explicitly specify it through the command line:

Nvidia-docker run-e NVIDIA_VISIBLE_DEVICES=0,1-- runtime=nvidia-it tensorflow/tensorflow-gpu:v1.13 bash

If you have already done the relevant configuration in the Docker configuration, it can be simplified to:

Docker run-e NVIDIA_VISIBLE_DEVICES=0,1-it tensorflow/tensorflow-gpu:v1.13 bash

Here, the environment variable NVIDIA_VISIBLE_DEVICES is used to specify the logical ID of the GPU card that needs to be bound, so you can bind the GPU card in the container, which is very easy to use.

Device-Plugin Mechanism of K8s

K8s supports non-default resource devices through the mechanism of Device-Plugin, such as RDMA devices, AMD GPU and so on. Of course, it also includes Nvidia GPU, which is most concerned about in this article. By writing the corresponding Device-Plugin, third-party resource devices can add support to the corresponding devices in K8s, so that users can get almost the same experience as native resources (CPU,Memory, etc.).

The Device-Plugin mechanism is essentially a RPC service. K8s defines a RPC calling interface, which can be implemented by third-party resource devices to enable the device to be supported on the K8s side, and it is not much different from the default resource in the way of use. Device-Plugin runs on the host side in the way of Daemonset, and communicates with Kubelet through a Socket file, thus reporting relevant information to K8s through Kubelet. The host node on which the Daemonset is deployed appears to contain hardware resources registered by Device-Plugin in K8s. The general principle of Device-Plugin is as follows:

(photo source: https://medium.com/@Alibaba_Cloud)

First, Device-Plugin needs to register the resource with K8s, and the registration mechanism is implemented by implementing the following RPC interface:

Service Registration {rpc Register (RegisterRequest) returns (Empty) {}}

In the detailed rpc call, the interface will report the socket name, the Api Version of the Device-Plugin and other information, of course, more importantly, it will report the ResourceName, the ResourceName will be registered by K8s as the name of the custom device, and the specification of the name is vendor-domain/resource. For example, the GPU of Nvidia is defined as nvidia.com/gpu, which is needed when users apply for resources. For example, in the resource configuration that creates the POD, you need to specify the resource as follows:

ApiVersion: v1kind: Podmetadata: name: demo-podspec: containers:-name: demo-container-1 image: k8s.gcr.io/pause:2.0 resources: limits: nvidia.com/gpu: 2

After registration, Device-Plugin also needs to report the number and status of devices on the host side. For example, if there are eight GPU cards on the host node, Device-Plugin will inform K8s of the number of resources and resource id list. In addition, when Pod applies for the resource to K8s, K8s will return the id sequence of the number of resources that meet the user's needs from the reported id list according to a certain policy, when the id list is returned to Device-Plugin, and then Device-Plugin maps to the real resource device according to a certain policy. The above procedure is mainly implemented by the following RPC calls:

Service DevicePlugin {rpc ListAndWatch (Empty) returns (stream ListAndWatchResponse) {} rpc Allocate (AllocateRequest) returns (AllocateResponse) {}}

For the ListAndWatch here, Device-Plugin will call the NVML library to get the GPU devices and status on the host side, and return them to the corresponding device list of K8s. Allocate will be called in the container to return some special configurations that can use the resource on the host, such as some environment variables, and then send this information to Kubelet and pass it to the container when the container starts. For Nvidia GPU, it is mainly the environment variable NVIDIA_VISIBLE_DEVICES mentioned earlier, so that the corresponding GPU device can be mounted when the container starts.

It is worth noting that there is no policy for gpu virtualization or video memory allocation on the container side, so the granularity of Nvidia-Device-Plugin allocation is a single card, and the absolute one-to-one mapping, that is, the number of reported GPU is the absolute number of GPU, and the K8s side is responsible for resource allocation, and then the Device-Plugin mounts the number of GPU and ID returned by K8s, and maps the container to the actual GPU device.

This method is reasonable in some scenarios, for example, in the mode of strong computational tasks, it can avoid the scramble for GPU card resources by different processes, which leads to the occurrence of explicit memory OOM and so on. But in some scenarios, lack will result in a huge waste of resources. For example, there are some container environments that only allow users to develop and debug algorithms, and such tasks do not actually use GPU most of the time. In this case, it is of great value to enable containers to share GPU appropriately. After all, GPU is very expensive, and we need to provide a sharing mechanism to maximize the use of resources. In addition, Nvidia-Device-Plugin does not consider the affinity of GPU, which may cause poor computing performance for containers with multiple cards in a single container. Here will introduce our implementation ideas.

How to make different containers share GPU?

As mentioned earlier, Device-Plugin reports the GPU resource list through the ListAndWatch interface, so naturally, we will think that if we can forge more virtual GPU ID to K8sMagi K8s to return virtual id when allocating POD resources, and then Device-Plugin is mapped back to the real GPU to realize the reuse of GPU cards, just like the mapping from virtual memory address to physical memory address, and virtual memory can be much larger than physical memory. Yes, the main idea is to construct a virtual GPU device id and map it to a real GPU device when the container is actually started. But there is a very important question that needs to be solved: how to ensure that the GPU load is generally balanced without having too many containers bound to some cards and none on others? This phenomenon is possible. The lifetime of containers varies, and continuous creation and deletion of containers is very likely to cause uneven allocation of GPU resources. Therefore, the mapping of virtual id to real id can not be determined by a simple linear mapping relationship, but by considering the real GPU load at the time of task creation to dynamically determine the id of mounting GPU.

To solve this problem, we have evaluated two solutions: call NVML to obtain the real-time status of the GPU card, and select the card with less load for allocation; use the external database to store the GPU load information of each node, and Device-Plugin calls the Redis information during Allocate to check the load situation. In the end, we use the second method, because NVML can only query the consumption of processes and resources. If a container is bound with a GPU card, but no process in the container uses GPU, then you cannot view the real container binding relationship with NVML. At this stage, we still want to limit the load to the number of containers bound to the card, rather than the actual GPU usage.

When we decide to use an external Redis database to store the real-time dynamic state of each node, we will maintain a separate map for each node in redis, recording the number of containers and task names assigned on each GPU id. Each time Device-Plugin decides on the allocation of Allocate, it will query the container load of each GPU id under this node in Redis, select the lowest part to bind, and increase the container load on the corresponding GPU id by 1. When the resource is released, the Device-Plugin does not know the released message. Through the Informer mechanism of K8s scheduling service, we capture the node information and task name of the released POD in the custom Informer, and subtract 1 from the number of resources of the corresponding GPU id in the Redis. The allocation information of GPU resources is maintained and monitored in this way. Here only introduce the ideas of the solution, the details will not be carried out.

How to bind a high affinity card to a multi-card task?

The channel affinity of GPU is very important in the multi-card training task, because different connection media have a great influence on the data transmission speed between cards. The following is a typical 8-card GPU inter-card channel topology diagram. It can be seen that some cards are connected by Nvlink (NV1, etc.), and some are connected by PCIe (PIX).

Different GPU channels will lead to completely different data transmission performance, usually the speed between them can be many times different, for example, Nvlink can reach dozens of GB/s, while PCIe usually only has about 10 GB/s throughput. The following figure shows the intuitive connectivity topology diagram and channel transmission performance of the Nvidia Tesla P100 series:

(photo source: https://www.nvidia.com)

As mentioned earlier, Nvidia-Device-Plugin does not consider the affinity of the channel, that is, in a single container with two cards, it is very likely that two cards connected through PCIe will be bound together through the scheduling of K8s, even if there are cards in the same Nvlink channel, which is obviously unreasonable. High affinity and container load balancing are sometimes contradictory requirements. If absolute affinity is pursued, container task load balancing may be sacrificed. The algorithm strategy we adopt is to achieve a balance between the two indicators as far as possible.

It is easier to simply bind high-affinity cards together without considering the real GPU workload. Similar to the idea of sharing GPU implementation, when the task of multi-card binding is heavy, we can call NVML in Device-Plugin to get the connection topology of GPU card, so as to know the affinity relationship between them. Then when Allocate, choose to assign GPU cards between high-affinity channels together. However, if you consider that some of the container tasks in the middle of the high affinity card are very high, then the effect of high load may be needed at this time. A more reasonable way is to use the scoring function to score different optional combinations according to a certain strategy and choose the combination with the highest score. But we adopt a simple and direct strategy: first select the GPU id with the least load, then select the GPU card in the same high affinity channel as the id, and select the card with the least task load to bind. The details are no longer expanded, mainly the call to the NVML library, you can get the channel topology of the host, and the rest of the work is logical.

Summary

This paper briefly introduces the support scheme of Docker to Nvidia GPU and the Device-Plugin mechanism of K8s. And provide solutions to some scene defects of the existing Nvidia-Device-Plugin. It mainly aims at the optimization of inter-task sharing and channel affinity of GPU cards. However, these modifications depend on the scene, some scenarios are worth and urgently need to do, but some scenarios are very unsuitable. Such modifications will increase external dependencies and complicate the binding strategy of Nvidia-Device-Plugin GPU cards, so I personally strongly recommend that these changes be carried out only when necessary. The interactive testing and verification scenario of the platform is precisely the scene and power of this transformation.

At this point, the study on "how to achieve GPU scheduling and sharing in Kubernetes" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.