In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article shows you how to use Bitfusion on Kubernetes for TensorFlow in-depth learning, the content is concise and easy to understand, it will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.
Background introduction
With the rapid development of AI technology, more and more enterprises begin to apply AI technology to their own business. At present, cloud AI computing power is mainly provided by three types of AI accelerators: GPU,FPGA and AI ASIC chips. The advantages of these accelerators are very high performance, while the disadvantages are high cost and lack of heterogeneous acceleration management and scheduling. Because most enterprises are unable to build an efficient accelerator resource pool, they have to use these expensive accelerator resources exclusively, resulting in low resource utilization and high cost.
Take GPU as an example, through the innovative Bitfusion GPU virtualization technology, users can transparently share and use the AI accelerator on any server in the data center without task modification, which can not only help users improve resource utilization, but also greatly facilitate the deployment of AI applications and build a data center-level AI accelerator resource pool.
Bitfusion helps solve these problems by providing remote GPU pools. Bitfusion makes GPU a first-class citizen that can be abstracted, partitioned, automated, and shared like traditional computing resources. On the other hand, Kubernetes has become a platform for deploying and managing machine learning loads.
Here is how to use the newly developed Bitfusion Device Plugin to quickly use the GPU resource pool provided by Bitfusion to enter the popular TensorFlow deep learning development on the TensorFlow line.
Concept understanding
Let's start with a brief introduction to the two modules of kubernetes:
Extended Resource: a custom resource expansion method that reports the name and total number of resources to API server, while Scheduler adds and subtracts the available amount of resources according to the creation and deletion of pod using the resource, and then determines whether there are nodes that meet the resource conditions at the time of scheduling. At present, the increase and decrease units of Extended Resource here must be integers, for example, you can allocate 1 GPU, but not 0. 5 GPU. This feature is stable at 1.8 because it only replaces Opaque integer resources and does some renaming work.
Device Plugin: by providing a common device plug-in mechanism and a standard device API interface. In this way, equipment manufacturers only need to implement the corresponding API interface and do not need to modify the Kubelet backbone code to support GPU, FPGA, high-performance NIC, InfiniBand and other devices. This capability is in the Alpha version of Kubernetes 1.8 and 1.9, and will enter the Beta version in 1.10. You need to open it through feature gate, that is, configure-- feature-gates=DevicePlugins=true
As shown in the figure, our processing layer currently controls the resources of pod in terms of controlling DSLRs through device plugin, and then BitfusionClient and BitfusionServer in pod interact at the CUDA driver level. There is a CUDA driver agent in BitfusionClient's software stack, which intercepts all CUDA service accesses on Client and sends data and service requests to BitfusionServer for processing through the network.
Installation and use steps of Bitfusion Device Plugin
The following example uses Kubernetes v1.17.5 and Ubuntu 18.04 as the installation environment to illustrate how Bitfusion Device Plugin builds a TensorFlow environment for Benchmarks testing. Currently, the project and container images are hosted on the R & D internal server. If you cannot access it, you can contact us through your VMware account representative.
First, let's download the Bitfusion Device Plugin project
Currently, the project code and bitfusion-base are not publicly available, which can be obtained by contacting us or the customer representative. Once obtained, you can continue to do the following.
We need to build Device Plugin's docker image first because we want to know the overall performance of the platform, so if we want to run some Benchmarks, we can build Docker image from the Dockerfile we provide:
Then configure the yaml file for Bitfusion Device Plugin
Bitfusion Device Plugin is a device extension that conforms to the Kubernetes device plugin interface specification. You can seamlessly add bitfusion.io/gpu resources to a Kubernetes cluster to use bitfusion in the container when deploying the application.
Modify as follows, update the image in the device_plugin.yml file, and Device Plugin will be installed as DaemonSet on the Kubernetes node.
ApiVersion: apps/v1kind: DaemonSetmetadata: name: bitfusion-cli-device-plugin namespace: kube-system labels: tier: nodespec: hostNetwork: true containers:-name: device-plugin-ctr image: bitfusion_device_plugin/bitfusion-device:v0.1 securityContext: privileged: true command: [. / start.sh "] env:-name: REG_EXP_SFC valueFrom: configMapKeyRef: name: configmap key: reg-exp-name: SOCKET_NAME valueFrom: configMapKeyRef: name: configmap key: socket-name-name: RESOURCE_ NAME valueFrom: configMapKeyRef: name: configmap key: resource-name volumeMounts:-mountPath: "/ root/.bitfusion" name: bitfusion-cli-mountPath: / gopath/run name: docker-mountPath: / gopath/proc name: proc-mountPath: "/ root/.ssh/id_rsa" name: ssh-key-mountPath: "/ var/lib/kubelet" name: kubelet-socket-mountPath: "/ etc/kubernetes/pki" name: pki volumes:-name: bitfusion-cli hostPath: path: / root/.bitfusion "- name: docker hostPath: path: / var/run-name: proc hostPath: path: / proc-hostPath: path:" / root/.ssh/id_rsa "name: ssh-key-hostPath: path:" / var/lib/kubelet "name: kubelet-socket-hostPath: path:" / etc/kubernetes/pki "name: pki
Then deploy using the following command
Kubeclt apply-f bitfusion-device-plugin/device_plugin.yml
After the execution is completed, wait for a period of time. If the deployment is successful, you can see that the status of Bitfusion Device Plugin is Running, and the log prints the status of the current device-plugin.
Build TensorFlow image for Benchmarks testing
We have provided a bitfusion-base image and a built bitfusion-tfl-cli image, which can be pulled and used directly or built on your own.
Docker build-f bitfusion-device-plugin/docker/bitfusion-tfl-cli/Dockerfile-t bitfusion-tfl-cli:v0.1FROM bitfusion-base:v0.1RUN conda install tensorflow-gpu==1.13.1
Add a tag in pod.yaml and modify the parameters by referring to the following:
Resource limit: you can set the number of bitfusion.io/gpu that the application can use
Configure pod bitfusion-device-plugin/example/pod/pod.yaml
-apiVersion: v1kind: ConfigMapmetadata: name: bfs-pod-configmap---apiVersion: v1kind: Podmetadata: name: bfs-demo labels: purpose: device-demospec: hostNetwork: true containers:-name: demo image: bitfusion-tfl-cli:v0.1 imagePullPolicy: Always workingDir: / root securityContext: privileged: true command: ["/ bin/bash", "- c" Args: ["python / benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py-- local_parameter_device=gpu-- batch_size=32-- model=inception3"] volumeMounts:-mountPath: "/ root/.bitfusion" name: config-volume resources: limits: bitfusion.io/gpu: 1 volumes:-name: config-volume hostPath: path: "/ root/.bitfusion"
Run the benchmark of TensorFlow on Kubernetes for testing
TensorFlow has its own official Benchmarks:tensorflow/benchmarks, which contains resnet50, resnet152, inception3, vgg16, googlenet, alexnet and other models. You only need to provide some parameters to start testing.
Here we choose the inception3 model to do the benchmark test to see whether the bitfusion client in the pod successfully connects with the bitfusion server.
Kubeclt apply-f bitfusion-device-plugin/example/pod/pod.yaml
After execution, wait for a period of time, and you can see the Pod of bfs-demo in the default project.
If the deployment is successful, Pod's log displays:
The above content is how to use Bitfusion on Kubernetes for TensorFlow in-depth learning. Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.