Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use TensorFlow

2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

This article mainly explains "how to use TensorFlow". The content in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn how to use TensorFlow.

Distributed TensorFlow

In April 2016, TensorFlow released version 0.8 to announce support for distributed computing, a feature we call Distributed TensorFlow.

This is a very important feature, because in the world of AI, the size of training data is usually eye-popping. For example, OUTRAGEOUSLY LARGE NEURAL NETWORKS: THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER, a paper published by Google Brain Lab this year, mentioned that the MOE Layer Model in the following picture can reach the scale of 68 billion Parameters. For such a complex model, it is time-consuming and difficult to accept if you can only train on a single machine. Through Distributed TensorFlow, many servers can be used to build TensorFlow Cluster to improve training efficiency.

For more information about Distributed TensorFlow, please refer to the official content www.tensorflow.org/deplopy/distributed. Here is the structure diagram of Distributed TensorFlow:

Why TensorFlow on Kubernetes

Although Distributed TensorFlow provides distributed capabilities, you can use server clusters to speed up training, but there are still many shortcomings, such as resources can not be isolated, problems left over by PS processes, and so on, which are what Kubernetes is good at. The following figure summarizes the reasons why you need to run TensorFlow on Kubernetes:

For us, the biggest pain point for users in the early stage is that the performance of HDFS Read used by the algorithm team is not as good as expected. after searching for information on the Internet and our own simple comparative tests, we find that GlusterFS is probably the most suitable for our distributed storage. Therefore, in our TensorFlow on Kubernetes project, we use GlusterFS to store training data, and worker will read the training data from GlusterFS for calculation.

Integrated Architecture

Description:

Supports both Between-Graph and In-Graph replication scenarios

PS Task is deployed through Kubernetes Deployment, Worker Task is deployed through Kubernetes Job, and service discovery is provided by Kubernetes service and KubeDNS

Each TensorFlow Cluster will Dynamic Provision PV through StorageClass, and a StorageClass for interfacing with Gluster clusters through Heketi will be created in advance.

GlusterFS clusters use Heketi to expose rest api and interact with Kubernetes. For the deployment of Heketi, please refer to the official documentation.

Each TensorFlow Cluster will eventually create two PV, one to store training data (mounted to container / data, corresponding to TensorFlow-data_dir configuration), and one to store training logs (mounted to container / log, corresponding to TensorFlow-log_path configuration)

Each user will create a namespace in Kubernetes.

Each user will be deployed with a Jupyter Notebook Deployment and Service,Service exposed to the cluster through NodePort.

There is a special node, we call it User Node. This node ensures that it will run Pod through Taint, but will expose the service in the cluster through kube-proxy. For example, the Jupyter Notebook service above will only be exposed on this node.

The User Node node stores the python algorithm written by the user, and you can view and download these algorithm files through http. In the Between-Graph scenario, these algorithm files will be downloaded through curl after the container starts.

A Tensorboard Deployment and Service,Serivce will be created for useless users to be exposed outside the cluster through NodePort (again only in User Node), and Tensorboard Pod will be hung with log PV so that you can get TensorFlow Graph.

Deploy Architecture

The whole system involves the following core Components:

TensorFlow: 1.3.0

Kubernetes: 1.7.4

Docker: 1.12.6

Harbor: 1.1.2

Contiv netplugin: 0.1-12-23-2016.19-44-42.UTC

Keepalived: 1.3.5

Haproxy:1.7.8

Etcd2: 2.3.7

Etcd3: 3.2.1

Glusterfs: 3.10.5

Network scheme: contiv netplugin + ovs + vlan.

Log scheme: fluentd + Kafka + ES + Kibana.

Monitoring scheme: cadvisor + prometheus + Grafana.

The details of CaaS are not discussed here, in fact, it is also a very familiar solution.

Demo

For this Demo, I will change it to NodePort to expose the Jupyter Nodebook. Just enter the correct token when logging in:

This is an In-Graph cluster. Click master_client.ipynb to see the specific training algorithm:

Click execute to see the output below:

This is just a simple Demo. In practice, the kubernetes yaml corresponding to each ps, worker and pvc is automatically generated, and the domain name is used for service discovery. Otherwise, if you use IP, you may need to use the ProStart Hook of Pod to feedback the IP of each Task, which will be troublesome.

Thinking

Q: problems left over from the PS process are often discussed in the community (issue 4173). Combined with Kubernetes, we can easily achieve the purpose of recycling the PS process. A: in the TaaS module of DevOps, start a cooperative program for each TensorFlow Cluster to check whether the counter reaches the number of worker (worker is run by job, after down, watch to job successed, the counter is increased by 1). If it is equal to the number of worker, it indicates that the training is over. After waiting for 30 seconds, call the kubernetes apiserver API to delete the ps deployment/service to achieve the effect of automatic recovery of ps.

Qworker is stateless, ps is stateful, and ps cannot do checkpoint. How to train save and restore?

A:worker is stateless, but tf.train.Saver provides the ability to checkpoint over worker, roughly based on the principle of get Parameters one by one from PS task and save persistence.

Q how to let users specify a small number of parameters such as ps and the number of worker to automatically generate kubernetes yaml?

A: since we have not done a front-end Portal for TaaS yet, it is generated automatically through jinja template. Users only need to specify a small number of parameters to generate the kubernetes yaml needed for ps and worker.

For example, here is one of my jinja template tfcluster_template.yaml.jinja

{%-set name = "imagenet" -%} {%-set worker_replicas = 3 -%} {%-set ps_replicas = 2 -%} {%-set script = "http://xxx.xx.xx.xxx:80/imagenet/imagenet.py" -%} {%-set image =" tensorflow/tensorflow:1.3.0 "-%} {%-set data_dir =" / data " "-%} {%-set log_dir =" / log "-%} {%-set port = 2222 -%} {%-set replicas = {" worker ": worker_replicas "ps": ps_replicas} -%} {%-macro worker_hosts () -%} {%-for i in range (worker_replicas) -%} {%-if not loop.first -%} {%-endif -%} {{name}}-worker- {{I}}: {{port} {%-endfor -%} {%-endmacro -%} {%-macro ps_hosts () -%} {%-for i in range (ps_replicas) -%} {%-if not loop.first -%} {%-endif -%} {{name}}-ps- {{I}}: {{port}} {%-endfor -%} {%-endmacro -%} {%-for job in ["worker" "ps"] -%} {%-for i in range -%} kind: Service apiVersion: v1 metadata: name: {{name}}-{{job}}-{{I}} spec: selector: name: {{name}} job: {{job}} Task: "{{I}}" ports:-port: {{port}} targetPort: 2222 {% if job = = "worker"%}-kind: Job apiVersion: batch/v1 metadata: name: {{name}}-{{job}}-{{I}} spec: replicas: 1 template: metadata: labels: name: {{name}} job: {{job}} task: "{{I}}" spec: containers:-name: {{name}}-{{job}}-{{I}} Image: {{image}} ports:-containerPort: 2222 command: ["/ bin/sh" Args: ["curl {{script}}-o / opt/ {{name}} .py Python / opt/ {{name}}. Py\-- ps_hosts= {{ps_hosts ()}}\-- worker_hosts= {{worker_hosts ()}}\-job_name= {{job}}\ -- task_index= {{I}}\-log_path= {{log_dir}}\-- data_dir= {{data_dir}} "] volumeMounts:-name: data mountPath: {{data_dir}}-name: log mountPath: {{log_dir}} restartPolicy: Never volumes:-name: data persistentVolumeClaim: ClaimName: {{name}}-data-pvc-name: log persistentVolumeClaim: claimName: {{name}}-log-pvc {% endif%} {% if job = = "ps"%}-kind: Deployment apiVersion: extensions/v1beta1 metadata: name: {{ Name}-{{job}}-{{I}} spec: replicas: 1 template: metadata: labels: name: {{name}} job: {{job}} task: "{{I}}" spec: containers: -name: {{name}}-{{job}}-{{I}} image: {{image}} ports:-containerPort: 2222 command: ["/ bin/sh" Args: ["curl {{script}}-o / opt/ {{name}} .py Python / opt/ {{name}}. Py\-- ps_hosts= {{ps_hosts ()}}\-- worker_hosts= {{worker_hosts ()}}\-job_name= {{job}}\ -- task_index= {{I}}\-- log_path= {{log_dir}} "] volumeMounts:-name: log mountPath: {{log_dir}} restartPolicy: Never volumes:-name: log persistentVolumeClaim: claimName: {{name}}-log-pvc {% endif%}- -{% endfor%} {%-endfor -%} apiVersion: v1 kind: PersistentVolumeClaim metadata: name: {{name}}-log-pvc annotations: volume.beta.kubernetes.io/storage-class: glusterfs spec: accessModes:-ReadWriteMany resources: requests: storage: 10Gi -apiVersion: v1 kind: PersistentVolumeClaim metadata: name: {{name}}-data-pvc annotations: volume.beta.kubernetes.io/storage-class: glusterfs spec: accessModes:-ReadWriteMany resources: requests: storage: 10Gi

Then execute python render_template.py tfcluster_template.yaml.jinja | kubectl apply-f-to complete the creation and startup of the corresponding Between-Graph TensorFlow Cluster.

Thank you for your reading, the above is the content of "how to use TensorFlow", after the study of this article, I believe you have a deeper understanding of how to use TensorFlow, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report