How to understand the performance of TensorFlow on Kubernetes 07/02 Update SLTechnology News&Howtos

How to understand the performance of TensorFlow on Kubernetes

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article introduces the relevant knowledge of "how to understand the performance of TensorFlow on Kubernetes". In the operation of practical cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Current performance problem description

Increasing the number of worker can bring better performance improvement in a certain range, but when the number of workers continues to increase, the improvement of training performance is not obvious.

Increasing the number of ps can improve the performance in a certain range, but when the number of PS continues to increase, the improvement of training performance is not obvious.

Possible reasons:

Strongly related to the distribution of ps and worker:

The current scheduling strategy mainly makes balanced scheduling according to the cpu and memory usage of the server, so as to make the cpu and memory utilization of each server in the cluster equal. In this case, the scheduling of ps and worker has a certain degree of randomness.

If each server containing worker has a corresponding ps when scheduling, then the training performance will be better? If so, how much performance will be improved?

Does worker in K8S have IO bottleneck when reading training data from HDFS cluster? It is possible that the configuration on the network or HDFS itself needs to be further checked through the monitoring of the HDFS cluster.

Here are the test scenarios and use cases designed for the first possible reason: strongly related to the distribution of ps and worker:

Scenario 1: each worker server will have a corresponding ps. Test case ID server number worker number ps number 11101 one server deployed 10 worker and 1 ps255055 server deployed 10 worker and 1 p3101001010 server deployed 10 worker and 1 p4202002020 server deployed 10 worker and 1 pTensorFlow tasks scheduling design diagram respectively

Scheduling implementation

TensorFlow object template scene1.jinja for scenario 1

# scene1.jinja-object template {%-set name = "# # NAME##" -%} {%-set worker_replicas = # # WN## -%} {%-set ps_replicas = # # PN## -%} {%-set script = "# # SCRIPT##" -%} {%-set case = "# # CASE##" -%} {%-set port = 2222 -%} {%-set log_host_dir = "/ var/" Log/tensorflow "-%} {%-set log_container_dir =" / var/log "-%} {%-set image =" registry.vivo.xyz:4443/bigdata_release/tensorflow1.3.0 "-%} {%-set replicas = {" worker ": worker_replicas "ps": ps_replicas} -%} {%-macro worker_hosts () -%} {%-for i in range (worker_replicas) -%} {%-if not loop.first -%} {%-endif -%} {{name}}-worker- {{I}}: {{port} {%-endfor -%} {%-endmacro -%} {%-macro ps_hosts () -%} {%-for i in range (ps_replicas) -%} {%-if not loop.first -%} {%-endif -%} {{name}}-ps- {{I}}: {{port}} {%-endfor -%} {%-endmacro -%} {%-for i in range (begin_index End_index) -%} {%-if task_type = = "worker"%}-- kind: ServiceapiVersion: v1metadata: name: {{name}}-{{task_type}}-{{I}} namespace: {{name}} spec: clusterIP: None selector: name: {{name}} job: {{task_type}} task: "{{I}" ports:-port: {{port}} TargetPort: 2222---kind: JobapiVersion: batch/v1metadata: name: {{name}}-{{task_type}}-{{I}} namespace: {{name}} spec: template: metadata: labels: name: {{name}} job: {{task_type}} task: "{{I}}" spec: imagePullSecrets:-name: harborsecret' Affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms:-matchExpressions:-key: "CASE" operator: In values:-"{{case}}"-key: "INDEX" operator: In Values:-"{I / / 10}}"-key: "SCENCE" operator: In values:-"1" containers:-name: {{name}}-{{task_type}}-{{I}} image: {{ Image}} resources: requests: memory: "4Gi" cpu: "300m" ports:-containerPort: 2222 command: ["/ bin/sh" "- c", "export CLASSPATH=.:/usr/lib/jvm/java-1.8.0/lib/tools.jar:$ (/ usr/lib/hadoop-2.6.1/bin/hadoop classpath-- glob) Wget-r-nH-np-- cut-dir=1-R 'index.html*,*gif' {{script}}; cd. / {{name}} Sh. / run.sh {{ps_hosts ()}} {{worker_hosts ()} {{task_type}} {{ps_replicas}} {{worker_replicas} "] restartPolicy: OnFailure {%-endif -%} {%-if task_type = =" ps "-%}-kind: ServiceapiVersion: v1metadata: name: {{name}}-{task_type}-{{I}} Namespace: {{name}} spec: clusterIP: None selector: name: {{name}} job: {{task_type}} task: "{{I}" ports:-port: {{port}} targetPort: 2222---kind: DeploymentapiVersion: extensions/v1beta1metadata: name: {{name}-{{task_type}}-{{I}} namespace: {{name}} spec: replicas: 1 template: Metadata: labels: name: {{name}} job: {{task_type}} task: {{I}} "spec: imagePullSecrets:-name: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms:-matchExpressions:-key:" CASE " Operator: In values:-"{{case}}"-key: "INDEX" operator: In values:-{{I}} "- key:" SCENCE "operator: In Values:-"1" containers:-name: {{name}}-{{task_type}}-{{I}} image: {{image}} resources: requests: memory: "4Gi" cpu: "2" ports:-containerPort: 2222 command: ["/ bin/sh" "- c", "export CLASSPATH=.:/usr/lib/jvm/java-1.8.0/lib/tools.jar:$ (/ usr/lib/hadoop-2.6.1/bin/hadoop classpath-- glob) Wget-r-nH-np-- cut-dir=1-R 'index.html*,*gif' {{script}}; cd. / {name}}; sh. / run.sh {{ps_hosts ()}} {{worker_hosts ()}} {{task_type}} {{I}} {ps_replicas} {{worker_replicas} "] restartPolicy: Always {%-endif -%} {%-endfor -%}

Label Nodes

Select the corresponding node and type the corresponding Label.

Kubectl label node $node_name SCENCE=1 CASE=? INDEX=? Test result

Test screenshot of use case 2:

Scenario 2: all ps and all worker are physically isolated. Test use case ID server number worker number ps number 12101 one server deploys 10 worker, another deployment 1 ps2102055 server deploys 10 worker,5 servers each deploys 1 ps320501010 server deploys 10 worker,10 servers deploys 1 ps4402002020 server deploys 10 worker,20 servers deploys 1 psTensorFlow tasks scheduling design diagram

Scheduling implementation

TensorFlow object template scene2.jinja for scenario 2

# scene2.jinja-object template {%-set name = "# # NAME##" -%} {%-set worker_replicas = # # WN## -%} {%-set ps_replicas = # # PN## -%} {%-set script = "# # SCRIPT##" -%} {%-set case = "# # CASE##" -%} {%-set port = 2222 -%} {%-set log_host_dir = "/ var/" Log/tensorflow "-%} {%-set log_container_dir =" / var/log "-%} {%-set image =" registry.vivo.xyz:4443/bigdata_release/tensorflow1.3.0 "-%} {%-set replicas = {" worker ": worker_replicas "ps": ps_replicas} -%} {%-macro worker_hosts () -%} {%-for i in range (worker_replicas) -%} {%-if not loop.first -%} {%-endif -%} {{name}}-worker- {{I}}: {{port} {%-endfor -%} {%-endmacro -%} {%-macro ps_hosts () -%} {%-for i in range (ps_replicas) -%} {%-if not loop.first -%} {%-endif -%} {{name}}-ps- {{I}}: {{port}} {%-endfor -%} {%-endmacro -%} {%-for i in range (begin_index End_index) -%} {%-if task_type = = "worker"%}-- kind: ServiceapiVersion: v1metadata: name: {{name}}-{{task_type}}-{{I}} namespace: {{name}} spec: clusterIP: None selector: name: {{name}} job: {{task_type}} task: "{{I}" ports:-port: {{port}} TargetPort: 2222---kind: JobapiVersion: batch/v1metadata: name: {{name}}-{{task_type}}-{{I}} namespace: {{name}} spec: template: metadata: labels: name: {{name}} job: {{task_type}} task: "{{I}}" spec: imagePullSecrets:-name: harborsecret' Affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms:-matchExpressions:-key: "CASE" operator: In values:-"{{case}}"-key: "INDEX" operator: In Values:-"{I / / 10}}"-key: "SCENCE" operator: In values:-"2"-key: "TYPE" operator: In values: -"worker" containers:-name: {{name}}-{{task_type}}-{{I}} image: {{image}} resources: requests: memory: "4Gi" cpu: "300m" ports:-containerPort: 2222 command: ["/ bin/sh" "- c", "export CLASSPATH=.:/usr/lib/jvm/java-1.8.0/lib/tools.jar:$ (/ usr/lib/hadoop-2.6.1/bin/hadoop classpath-- glob) Wget-r-nH-np-- cut-dir=1-R 'index.html*,*gif' {{script}}; cd. / {{name}} Sh. / run.sh {{ps_hosts ()}} {{worker_hosts ()} {{task_type}} {{ps_replicas}} {{worker_replicas} "] restartPolicy: OnFailure {%-endif -%} {%-if task_type = =" ps "-%}-kind: ServiceapiVersion: v1metadata: name: {{name}}-{task_type}-{{I}} Namespace: {{name}} spec: clusterIP: None selector: name: {{name}} job: {{task_type}} task: "{{I}" ports:-port: {{port}} targetPort: 2222---kind: DeploymentapiVersion: extensions/v1beta1metadata: name: {{name}-{{task_type}}-{{I}} namespace: {{name}} spec: replicas: 1 template: Metadata: labels: name: {{name}} job: {{task_type}} task: {{I}} "spec: imagePullSecrets:-name: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms:-matchExpressions:-key:" CASE " Operator: In values:-"{{case}}"-key: "INDEX" operator: In values:-{{I}} "- key:" SCENCE "operator: In Values:-"2"-key: "TYPE" operator: In values:-"ps" containers:-name: {{name}}-{{task_type}}-{{I}} image: {{image}} Resources: requests: memory: "4Gi" cpu: "2" ports:-containerPort: 2222 command: ["/ bin/sh" "- c", "export CLASSPATH=.:/usr/lib/jvm/java-1.8.0/lib/tools.jar:$ (/ usr/lib/hadoop-2.6.1/bin/hadoop classpath-- glob) Wget-r-nH-np-- cut-dir=1-R 'index.html*,*gif' {{script}}; cd. / {name}}; sh. / run.sh {{ps_hosts ()}} {{worker_hosts ()}} {{task_type}} {{I}} {ps_replicas} {{worker_replicas} "] restartPolicy: Always {%-endif -%} {%-endfor -%}

Label Nodes

Select the corresponding node and type the corresponding Label.

Kubectl label node $node_name SCENCE=1 CASE=? INDEX=? TYPE=? Test result

Test screenshot of use case 2:

Test conclusion and thinking

Comparing the monitoring data of use case 2 (5 ps,50 worker) in two different scenarios, the following phenomena are found:

In both scenarios, although 5 ps have been created, only one ps has a high load, and the other ps either have a cpu usage of less than 10% or even almost zero.

For the same worker number and ps number in both scenarios, there is little difference in the cpu and memory consumed by the entire tensorflow cluster.

Test conclusion

In distributed tensorflow, which ps each worker chooses as its own parameter server has nothing to do with how we force the layout of ps and worker to be distributed, but is controlled by the distributed tensorflow itself (related to the strategy set by tf.train.replica_device_setter ()).

Problem thinking

Why is only one ps working in this training? Is the algorithm only one Big parameter? If so, only one ps is used by default according to the Round-Robin policy, which explains the problem. This needs to be confirmed by the brothers of the algorithm.

If you split the Big parameters into many Small parameters, using one of the RR or LB or Partition strategies, you should be able to use multiple ps to update the parameters to significantly improve the training performance.

Through this toss, not nothing, at least found that we do not know much about the internal working principle of Distributed TensorFlow, it is very necessary to go deep into the source code analysis.

This is the end of the content of "how to understand TensorFlow on Kubernetes performance". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.