How to use the architecture of Kubernetes 07/08 Update SLTechnology News&Howtos

How to use the architecture of Kubernetes

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "how to use the architecture of Kubernetes". The content of the explanation is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "how to use the architecture of Kubernetes".

Distributed TensorFlow

TensorFlow is an open source software library that uses data flow diagrams for numerical calculations. The nodes in the graph represent mathematical operations, while the edges in the graph represent the multi-dimensional array (tensor) transferred between these nodes. This flexible architecture allows you to deploy computing to one or more CPU or GPU on your desktop, server, or mobile device using a single API. I won't dwell on the basic concepts of TensorFlow.

Stand-alone TensorFlow

The following is a schematic diagram of stand-alone TensorFlow training, submitting the Session through Client, defining which cpu/gpu to do with the worker.

Distributed TensorFlow

In April 2016, TensorFlow released version 0.8 and announced its support for distributed computing, which we call Distributed TensorFlow. This is a very important feature because in the world of AI, the amount of data and model parameters trained is usually very large. For example, the paper OUTRAGEOUSLY LARGE NEURAL NETWORKS: THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER published by Google Brain Lab this year mentions a model of 68 billion Parameters. If you can only train on a single machine, it is time-consuming and difficult to accept. Through Distributed TensorFlow, a large number of servers can be used to build distributed TensorFlow clusters to improve training efficiency and reduce training time.

Through the TensorFlow Replcation mechanism, users can distribute SubGraph to different servers for distributed computing. There are two copy mechanisms of TensorFlow, In-graph and Between-graph.

To put it simply, In-graph Replication defines the work of all the task of this TensorFlow cluster through a single client session.

Between-graph Replication, by contrast, means that each worker has its own client to define its own work.

The following is an abstract distributed TensorFlow Framework as follows:

Let's start with a few concepts:

Cluster

A TensorFlow Cluster consists of one or more jobs, and each job consists of one or more tasks. The definition of Cluster is defined by tf.train.ClusterSpec. For example, the ClusterSpec that defines a TensorFlow Cluster with 3 worker and 2 ps is as follows:

Tf.train.ClusterSpec ({"worker": ["worker0.example.com:2222", / / hostname can also use IP "worker1.example.com:2222", "worker2.example.com:2222"], "ps": ["ps0.example.com:2222", "ps1.example.com:2222"]})

Client

Client is used to build a TensorFlow Graph and to build a tensorflow::Session to communicate with the cluster. A Client can interact with multiple TensorFlow Server, and a Server can serve multiple Client.

Job

A Job consists of tasks list, which can be divided into two types: ps and worker. Ps is parameter server, which is used to store and update variables, while worker can be considered stateless and used as a computing task. In workers, it is common to choose a chief worker (usually worker0) to be used as a training state checkpoint. If there is a worker failure, you can restore from the latest checkpoint.

Task

Each Task corresponds to a TensorFlow Server and a separate process. A Task belongs to a Job, and its position in the tasks of the corresponding Job is marked by an index. Each TensorFlow implements Master service and Worker service. Master service is used for grpc interaction with worker services in the cluster. Worker service uses local device to calculate subgraph.

For more information about Distributed TensorFlow, please refer to the official www.tensorflow.org/deplopy/distributed

Defects of distributed TensorFlow

Distributed TensorFlow can make use of the resource pool composed of all the servers in the data center, so that a large number of ps and worker can be distributed in different servers for parameter storage and training, which is undoubtedly the key point of whether TensorFlow can be landed in the enterprise. However, this is not enough, it also has a congenital deficiency:

The Task resources of TensorFlow can not be isolated during training, which may lead to the interaction between tasks due to resource preemption.

Lack of scheduling ability, users are required to manually configure and manage the computing resources of tasks.

When the scale of the cluster is large, the management of training tasks is very troublesome. To track and manage the status of each task, a lot of development needs to be done at the upper level.

If you want to view the training logs of each Task, you need to find out the corresponding server and ssh the past, which is very inconvenient.

TensorFlow native backend file systems are only supported: standard Posix file systems (such as NFS), HDFS, GCS, memory-mapped-file. The data in most enterprises exist on the big data platform, so it is mainly based on HDFS. However, the Read performance of HDFS is not very good.

When you try to create a large-scale TensorFlow cluster, it's not easy.

Architecture and principle of TensorFlow on Kubernetes

These shortcomings of TensorFlow are precisely the strengths of Kubernetes:

Provide ResourceQuota, LimitRanger and other resource management mechanisms, can achieve a good resource isolation between tasks.

Support the configuration and scheduling of computing resources for tasks.

Training tasks are run in container mode, and Kubernetes provides a full set of container PLEG interfaces, so the management of task status is very convenient.

Easily interfacing with log solutions such as EFK/ELK, users can easily view task logs.

Support distributed storage (Glusterfs) with better Read performance, but we have not yet interfaced with Glusterfs. We have a plan but no manpower.

Create a large-scale TensorFlow cluster easily and quickly through declarative files.

TensorFlow on Kubernetes architecture

TensorFlow on Kubernetes principle

In our TensorFlow on Kubernetes scenario, the following Kubernetes objects are mainly used:

Kubernetes Job

We use Kubernetes Job to deploy TensorFlow Worker,Worker training to exit normally, and the container will not be restarted. Note that the Pod Template restartPolicy in Job can only be Never or OnFailure, not Always. Here, we set restartPolicy to restart automatically if OnFailure,worker exits abnormally. However, it should be noted that the training can be started from checkpoint restore after worker restart, otherwise the training after a few days of running may be wasted if worker restarts from step 0. If you use algorithms written by TensorFlow advanced API, this is achieved by default, but if you are using the underlying core API, be sure to implement it yourself.

Kind: JobapiVersion: batch/v1metadata: name: {{name}}-{{task_type}}-{{I}} namespace: {{name}} spec: template: metadata: labels: name: {{name}} job: {{task_type} task: "{{I}" spec: imagePullSecrets:-name: harborsecret containers:-name: {{name}}-{{task_type}}-{{I}} image: {{image}} resources: requests: memory: "4Gi" cpu: "500m" ports:-containerPort: 2222 command: ["/ bin/sh" "- c", "export CLASSPATH=.:/usr/lib/jvm/java-1.8.0/lib/tools.jar:$ (/ usr/lib/hadoop-2.6.1/bin/hadoop classpath-- glob) Wget-r-nH-np-- cut-dir=1-R 'index.html*,*gif' {{script}}; cd. / {{name}}; sh. / run.sh {{ps_hosts ()}} {{worker_hosts ()}} {{task_type}} {{I}} {ps_replicas}} {{worker_replicas}} "] restartPolicy: OnFailure

Kubernetes Deployment

TensorFlow PS is deployed using Kubernetes Deployment. Why not use Job for deployment, just like worker? That's fine, but given that the PS process doesn't automatically exit when all worker training is complete (hang all the time), it doesn't make sense to use Job deployment.

Kind: DeploymentapiVersion: extensions/v1beta1metadata: name: {{name}}-{{task_type}}-{{I}} namespace: {{name}} spec: replicas: 1 template: metadata: name: {{name}} job: {{task_type}} task: "{{I}}" spec: imagePullSecrets:-name: harborsecret containers: -name: {{name}}-{{task_type}}-{{I}} image: {{image}} resources: requests: memory: "4Gi" cpu: "500m" ports:-containerPort: 2222 command: ["/ bin/sh" "- c", "export CLASSPATH=.:/usr/lib/jvm/java-1.8.0/lib/tools.jar:$ (/ usr/lib/hadoop-2.6.1/bin/hadoop classpath-- glob) Wget-r-nH-np-- cut-dir=1-R 'index.html*,*gif' {{script}}; cd. / {{name}}; sh. / run.sh {{ps_hosts ()}} {{worker_hosts ()}} {{task_type}} {{I}} {ps_replicas}} {{worker_replicas}} "] restartPolicy: Always

For the problem of suspending the TensorFlow PS process, please refer to https://github.com/tensorflow/tensorflow/issues/4713. We solved this problem by developing a module to watch all the worker states of each TensorFlow cluster. When all worker correspond to Job Completed, the corresponding Deployment of PS will be deleted automatically, so that the kill PS process releases resources.

Kubernetes Headless Service

Headless Service is usually used to solve the internal communication between application clusters deployed in Kubernetes. Here, we use it the same way, we will create a Headless Service for each TensorFlow corresponding Job and Deployment object as the communication agent for worker and ps.

Kind: ServiceapiVersion: v1metadata: name: {{name}}-{{task_type}}-{{I}} namespace: {{name}} spec: clusterIP: name: {{name}} job: {{task_type}} task: "{{I}}" ports:-port: {{port}} targetPort: 2222

The advantage of using Headless Service is that in KubeDNS, the domain name resolution of Service Name corresponds directly to PodIp, and there is no service VIP layer, so it does not depend on kube-proxy to create iptables rules. The lack of the iptables layer of kube-proxy brings a performance improvement.

In the TensorFlow scenario, this is not to be underestimated, because a TensorFlow Task will create a service, and tens of thousands of service is normal. If you use Normal Service,iptables rules, it will take hours or even days to add or delete one iptabels rule, and the cluster has already collapsed. For the performance test data of kube-proxy iptables mode, please refer to the sharing of Huawei PaaS team.

KubeDNS Autoscaler

As mentioned earlier, each TensorFlow Task creates a service and has a corresponding resolution rule in KubeDNS. However, when the number of service is too large, we find that some worker domain name resolution has a high probability of failure, which takes more than a dozen times to resolve successfully. This will affect the session establishment of each task in the TensorFlow cluster, which may lead to the failure of the TensorFlow cluster.

In order to solve this problem, we introduce the incubation project kubernetes-incubator/cluster-proportional-autoscaler of Kubernetes to dynamically scale KubeDNS. For the specific details of this issue, interested students can check out my blog https://my.oschina.net/jxcdwangtao/blog/1581879.

TensorFlow on Kubernetes practice

Based on the above scheme, we have developed a TaaS platform that has implemented basic functions, including algorithm management, training cluster creation and management, model management, model online (TensorFlow Serving), one-click creation of TensorBoard services, task resource monitoring, cluster resource monitoring, scheduled training management, online viewing of task logs, batch packaging and download, and so on. You can refer to the previous article http://dockone.io/article/3036 shared on DockOne for this section.

This is just the beginning, and I'm doing the following features:

Preemptive task scheduling based on training priority is supported: when you create a TensorFlow training project on TaaS, you can specify the priority of the project as Production, Iteration, PTR. The default is iteration. The priority from high to low is * * Production-- > Iteration-- > PTR**. However, when the cluster resources are insufficient, preemptive scheduling is carried out according to the task priority.

Provide a view of resource allocation in the form of Yarn, which makes it clear for users to take up resources of all their training programs.

A hybrid deployment of training and prediction to provide data center resource utilization.

...

Experience and pit

In the whole process, we encountered a lot of pits, including TensorFlow and Kubernetes, but the most common problem is the CNI network plug-in contiv netplugin, which is basically caused by this network plug-in every time. Kubernetes is the least problematic, and its stability is better than I expected.

The problem of contiv netplugin is stable in the DevOps environment. In large-scale AI scenarios with high concurrency, problems emerge one after another, resulting in a large number of junk IP and Openflow flow tables, directly turning Node into NotReady, not to mention the details, because as far as I know, there are very few companies using this plug-in, so come to me if you want to know about it.

In our solution, a TensorFlow training cluster corresponds to a Kubernetes Namespace. At the beginning of the project, we did not clean up the garbage Namespace in time. Later, when there were tens of thousands of Namespace in the cluster, the relevant API performance of the entire Kubernetes cluster was very poor, resulting in a very poor user experience of TaaS.

TensorFlow grpc performance is poor, thousands of worker training clusters, the probability of such an error grpc_chttp2_stream request on server; last grpc_chttp2_stream id=xxx, new grpc_chttp2_stream id=xxx, this is the performance problem of the underlying grpc of TensorFlow, the low version of grpc Handlergrpc is still single-threaded, you can only try to upgrade grpc by upgrading TensorFlow, or upgrade the grpc version separately when compiling TensorFlow. If you upgrade the TensorFlow version, your algorithm may also have to do API adaptation. At present, we reduce the grpc pressure by increasing the computing load of a single worker to reduce the number of worker.

There are also the problems of TensorFlow's own OOM mechanism and so on.

Thank you for your reading, the above is the content of "how to use the architecture of Kubernetes", after the study of this article, I believe you have a deeper understanding of how to use the architecture of Kubernetes, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.