What is the Device Plugin design of Kubernetes? 07/04 Update SLTechnology News&Howtos

What is the Device Plugin design of Kubernetes?

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

In this article, the editor introduces in detail "what the Device Plugin design of Kubernetes is like". The content is detailed, the steps are clear, and the details are handled properly. I hope this article "what the Device Plugin design of Kubernetes is like" can help you solve your doubts.

Interpretation of Device Plugin Design of Kubernetes

Recently, after investigating the GPU scheduling and running mechanism of Kubernetes, it is found that the traditional alpha.kubernetes.io/nvidia-gpu will be offline in version 1.11, and the scheduling and deployment code related to GPU will be completely removed from the backbone code.

Instead, through the two built-in modules of Kubernetes in Extended Resource+Device Plugin, plus the corresponding Device Plugin implemented by the equipment provider, it completes the scheduling from the cluster level of the device to the work node, to the actual binding of the device and the container.

The first question to think about is why the GPU function that has been in the alpha.kubernetes.io/nvidia-gpu backbone for a year has been completely removed.

OutOfTree is a good idea for Kubernetes, and the previous refactoring of Cloud Provider is a similar work. For Kubernetes, instead of being a Swiss Army knife, focus on its core and general competencies, leaving work like GPU,InfiniBand,FPGA and public cloud capabilities entirely to the community and domain experts. On the one hand, it can reduce the complexity of the software itself and reduce the risk of stability. In addition, the separate iteration of OutOfTree can also achieve more flexible function upgrades.

Let's start with a brief introduction to the two modules of kubernetes:

Extended Resource: a custom resource expansion method that reports the name and total number of resources to API server, while Scheduler adds and subtracts the available amount of resources according to the creation and deletion of pod using the resource, and then determines whether there are nodes that meet the resource conditions at the time of scheduling. At present, the increase and decrease units of Extended Resource here must be integers, for example, you can allocate 1 GPU, but not 0. 5 GPU. This feature is stable at 1.8 because it only replaces Opaque integer resources and does some renaming work. But when the keyword integer is removed, it also leads us to imagine whether there will be a possibility of 0.5 in the future.

Device Plugin: by providing a common device plug-in mechanism and a standard device API interface. In this way, equipment manufacturers only need to implement the corresponding API interface and do not need to modify the Kubelet backbone code to support GPU, FPGA, high-performance NIC, InfiniBand and other devices. This capability is in the Alpha version of Kubernetes 1.8 and 1.9, and will enter the Beta version in 1.10.

It should be said that this feature is still relatively new and needs to be turned on through feature gate, that is, configuration-- feature-gates=DevicePlugins=true.

Device Plugin Design: API Design:

In fact, Device plugins is actually a simple grpc server. You need to implement the following two methods, ListAndWatch and Allocate, and listen to the Unix Socket in the / var/lib/kubelet/device-plugins/ directory, such as / var/lib/kubelet/device-plugins/nvidia.sock.

Service DevicePlugin {/ / returns a stream of [] Device rpc ListAndWatch (Empty) returns (stream ListAndWatchResponse) {} rpc Allocate (AllocateRequest) returns (AllocateResponse) {}}

Where:

ListAndWatch: Kubelet will call this API for device discovery and status updates (such as when the device becomes unhealthy)

Allocate: when Kubelet creates a container to use the device, Kubelet calls the API to perform the appropriate operation on the device and notifies Kubelet to initialize the configuration of the device,volume and environment variables required for the container.

Plug-in Lifecycle Management:

When the plug-in starts, it registers with Kubelet in the form of grpc via / var/lib/kubelet/device-plugins/kubelet.sock, providing the plug-in's listening Unix Socket,API version number and device name (such as nvidia.com/gpu). Kubelet will expose these devices to the Node state and send them to the API server as required by Extended Resource, and the subsequent Scheduler will schedule based on this information.

After the plug-in starts, Kubelet establishes a listAndWatch persistent connection to the plug-in, and when the plug-in detects that a device is unhealthy, it actively notifies Kubelet. At this point, if the device is idle, Kubelet will remove it from the distributable list; if the device is already in use by a pod, Kubelet will kill the Pod

After the plug-in starts, you can use the socket of Kubelet to check the status of Kubelet continuously. If Kubelet restarts, the plug-in will restart accordingly and register itself with Kubelet again.

Deployment mode

Daemonset and non-containerized deployment are generally supported. Currently, deamonset deployment is officially recommended.

Implement the official GPU plug-in for the sample Nvidia

NVIDIA provides a GPU device plug-in NVIDIA/k8s-device-plugin based on the Device Plugins interface, which is easier from the user's point of view. Compared to traditional alpha.kubernetes.io/nvidia-gpu, you no longer need to use volumes to specify the libraries that CUDA needs to use.

ApiVersion: apps/v1kind: Deploymentmetadata: name: tf-notebook labels: app: tf-notebookspec: template: # define the pods specifications metadata: labels: app: tf-notebookspec: containers:-name: tf-notebook image: tensorflow/tensorflow:1.4.1-gpu-py3 resources: limits: nvidia.com/gpu: 1Google GCP GPU plug-in

GCP also provides a GPU device plug-in implementation, but only supports running on the Google Container Engine platform, you can learn about it through container-engine-accelerators

Solarflare NIC plug-in

Solarflare, a network card manufacturer, has also implemented its own device plug-in sfc-device-plugin, which can experience the user experience through demo.

After reading this, the article "what is the Device Plugin design of Kubernetes" has been introduced. If you want to master the knowledge of this article, you still need to practice and use it yourself to understand it. If you want to know more about related articles, welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.