What is the flexible distributed usage method based on AllReduce? 07/06 Update SLTechnology News&Howtos

What is the flexible distributed usage method based on AllReduce?

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

Today, the editor will share with you the relevant knowledge about what the flexible distributed use method based on AllReduce is. The content is detailed and the logic is clear. I believe most people still know too much about this knowledge, so share this article for your reference. I hope you can get something after reading this article. Let's take a look.

Background

First of all, let's briefly review the model training of deep learning. The training mentioned here refers to the process of using data to iteratively optimize the parameters of the neural network by calculating the gradient decline, and finally output the network model. In this process, the calculation is usually accelerated with the help of GPU in the process of iterative calculation. Compared with CPU, it can achieve 10-100x acceleration. Distributed model training was first proposed by Mu Li on OSDI'14. In the traditional model training, the iterative computing process can only make use of all the hardware resources on the host where the current process is located. However, the scalability of the stand-alone machine is always limited, and the speed of stand-alone training will be somewhat stretched when the data set is very large or the model is particularly complex. Distributed training can be accelerated with the help of hardware resources on different hosts, which greatly improves the training speed.

Horovod is a distributed training framework based on AllReduce. With its support for mainstream deep learning frameworks such as TensorFlow and PyTorch, as well as communication optimization, Horovod is widely used in data parallel training. In Horovod, the training process is an equal participant, and each process is not only responsible for the distribution of the gradient, but also responsible for the specific gradient calculation. As shown in the following figure, the gradient in the three Worker is evenly divided into three parts, and the calculation and synchronization of the cluster gradient can be completed through four communications.

Distributed training based on AllReduce has gradually become the mainstream way of distributed training because of its easy-to-understand programming logic and greatly improved training speed. However, there are still some problems with the current model:

First of all, the cost of AI training is significant. With the help of Kubernetes, large-scale distributed training is no longer complex, but the high training cost makes it difficult for this technology to achieve universal benefits.

Secondly, compared with stand-alone training, distributed training is more likely to fail tasks. In distributed training, there are multiple processes participating in the training at the same time, and one of them has a problem, so the whole training task will fail. This problem is especially serious when the training task lasts for days or even weeks.

At the same time, due to the periodic fluctuation of business pressure in some mixed-part clusters, the GPU occupancy rate is usually less than 40% in idle time. However, in contrast, when tasks are submitted intensively, the resources of the cluster will be tight. The problem of time imbalance in the utilization of resources is very prominent.

Flexible training

In order to solve the above problems and better release the cloud native dividend to distributed training, the industry put forward the concept of flexible training.

In the traditional deep learning distributed training task, the instance configuration of the task is usually fixed. This greatly limits the flexibility and training speed of tasks, and is not friendly to the resource utilization of the whole cluster. Flexible training means that the training task can dynamically adjust the number of instances involved in the calculation at run time. This makes the training more flexible, and at the same time, it can expand and schedule better with the load of the cluster. This feature brings a lot of benefits to the training scenario:

The improvement of fault tolerance. Under such a selection, the failure of all instances can be tolerated. A task will no longer fail as a whole because of a process error.

The improvement of resource utilization. When the cluster resource is tight, by reducing the number of instances of low-priority training tasks, the resource quota of high-priority training tasks can be guaranteed and the SLA of the service can be guaranteed. When the cluster resources are idle, you can use the idle GPU and other resources to speed up the training by creating more instances to join the training. This not only improves the training speed of the task, but also improves the resource utilization of the cluster.

Achieve cloud native AI training, cooperate with bidding examples and other cloud resources to better reduce cloud costs. Bidding instances have great cost advantages over pay-by-quantity and other instances, but they are also faced with the problem that they may be recycled at any time. Flexible training can perfectly fit this scenario. When the bidding instance is available, create a training task in the bidding instance, and when the bidding instance is recovered, the training task can still continue.

Flexible distributed training can solve the problems of cost, resource utilization and fault tolerance of distributed training. Although it seems that resilient training can only dynamically adjust the instances of training tasks, it can interact with the cloud native capabilities provided by the public cloud and produce greater value.

In our actual test, the flexible training based on Horovod can reduce the cost per GPU from 16.21 yuan to 1.62 yuan, and the cost of the whole model training can be reduced by nearly 70%. If the cost remains the same, the elastic model training on the bidding instance can buy more GPU cards, and the training speed can be increased by 5 to 10 times. A training task that would have taken a day could be completed in a few hours. Furthermore, with the combination of flexible training and cluster scheduling, there are more possibilities to explore.

Horovod is one of the most widely used training frameworks in data parallel distributed training, so we take the training framework Horovod as an example to introduce how the elastic training scheme of Horovod falls to the ground in the native cloud environment.

Horovod Elastic

Uber's open source Horovod framework, as a widely used training framework for data parallelism, also began to address the need for resilient training in the summer of 2020. Finally, Elastic Horovod was released in Horovod v0.20.0.

In order to achieve flexible training, Horovod Elastic has made some changes to the architecture and implementation of Horovod, which mainly include:

The aggregation operation needs to be defined under the hvd.elastic.run function

Each worker has its own state and is synchronized before training

The addition or decrease of worker will trigger reset events on other worker

The reset event activates the following actions (not necessarily all): A. whether worker should continue to run b. Add invalid worker to the blacklist c. Start the worker process d. 0 on the new hosts. Update rank information for worker

After the event is reset, the state of each worker is synchronized

In practice, users need to provide horovodrun with a discover_hosts.sh script to provide real-time feedback on the currently available hosts and the slots on each hosts (the script is referred to as discover_hosts.sh below, but the script does not need to be named discover_hosts.sh).

Horovod Elastic on Kubernetes

Before the introduction of the Elastic feature, MPI-Operator in the Kubeflow community was the mainstream solution for deploying and running Horovod on a Kubernetes cluster. Although MPI-Operator has experienced three versions: v1alpha1, v1alpha2 and v1, its thinking is generally the same. The main processes include:

MPIJob Controller will generate a launcher pod and a corresponding number of worker pod according to the configuration of each MPIJob.

MPIJob Controller generates a ConfigMap for each MPIJob, which contains two scripts, one is the hostfile for all the worker pod of the task, and the other is the kubexec.sh script

Mpirun on Launcher pod uses the kubexel in ConfigMap to pull processes in worker pod. It should be noted that the execution of kubectl depends on the RBAC resources pre-created by MPIJob Controller (if the corresponding Role does not give launcher pod execution permission to configure on worker pod, launcher pod will be denied when executing kubectl exec`)

Previously, there were several compatibility issues between MPI-Operator and Elastic Horovod. Since there are some differences between the three versions of MPI-Operator, we will only discuss the v1 version here:

MPI-Operator does not provide discover_hosts.sh yet, which directly causes Elastic Horovod to fail to use.

When the user reduces the worker replicas, controller will not take any action against the "extra" worker pod, which will cause the worker pod to not be released and the instance size of the training task cannot be reduced.

When the user increases the worker replica, controller will not add the execution right of worker to the Role configuration of launcher pod, which will cause the horovodrun on launcher pod to be rejected by Kubernetes's rights management mechanism when trying to execute the process on the newly created worker pod using kubectl.

Based on these compatibility problems, we put forward Elastic Horovod on MPIJob: https://github.com/kubeflow/mpi-operator/pull/335 in the community. With the modification of https://github.com/horovod/horovod/pull/2199 to Horovod, the flexible training of Horovod can be realized on Kubernetes.

In this scheme, the most critical problem is how to implement the function of discover_hosts.sh on launcher pod. The key to realize this function on Kubernetes is how to obtain the worker pods which is currently in the state of Running. There are two ways of thinking.

MPIJob Controller builds discover_hosts.sh and synchronizes to launcher pod through ConfigMap

MPIJob Controller itself is listening for information related to pods. Using the podLister in controller, you can quickly list the worker pods of each MPIJob.

After filtering out the worker pods of the Running status according to the status.phase,controller of pods, you can build a discover_hosts.sh that reflects the current worker pods status.

Through ConfigMap,controller, you can synchronize discover_hosts.sh to launcher pod like hostfile and kubexec.sh scripts.

Use the existing kubectl in launcher pod to obtain real-time worker pod information from APIServer.

2.Launcher pod itself has been bound with the "get" and "list" permissions of pods. The corresponding pod information can be obtained by calling kubectl or other Kubernetes client directly, and the information expected by Elastic Horovod can also be returned by the same filtering criteria.

Considering that the second way of thinking can not limit the frequency of users to execute discover_hosts.sh, if users execute too frequently or the scale of MPIJob is large, it will cause greater pressure on the Kubernetes cluster, the first way of thinking is more comprehensive in terms of management and control.

One amendment to idea 2 is to change kubectl or client to a podLister to run in launcher pod, thus reducing the pressure on APIServer. However, this approach allows two processes to run in launcher pod. When the podLister process fails, the lack of a suitable mechanism to pull it back up will cause subsequent elastic training failure.

Therefore, we chose the first idea in our proposal, so that controller synchronizes discover_hosts.sh into launcher pod through ConfigMap and mounts it under / etc/mpi/discover_hosts.sh. At the same time, the proposal also makes corresponding changes to controller for the other two compatibility issues. These changes do not affect MPI tasks that are not in Elastic mode, and users only need to ignore discover_hosts.sh.

Of course, this scheme also has some problems. There is a delay in synchronizing ConfigMap to launcher pod. On the one hand, however, this delay can be adjusted by the Kubernetes administrator. On the other hand, this delay is acceptable compared to the time spent on the entire training and the time spent on the reset of Elastic Horovod.

Flexible training demonstration

Finally, we use an example to demonstrate how to run the Horovod resilient training task on Kubernetes. The process of task creation is similar to a normal training task, that is, it is created through MPIJob.

Bash-5.0$ kubectl create-f. / tensorflow-mnist-elastic.yamlmpijob.kubeflow.org/tensorflow-mnist-elastic createdbash-5.0$ kubectl get poNAME READY STATUS RESTARTS AGEtensorflow-mnist-elastic-launcher 1 Running 0 14stensorflow-mnist-elastic-worker-0 1 14stensorflow-mnist-elastic-worker-1 1 Running 0 14stensorflow-mnist-elastic-worker-1 1 Running 0 14s

In the example, we created two worker to participate in the training. After the training starts, adjust the number of MPIJob.Spec.MPIReplicaSpecs ["Worker"] .Replicas instances, add a new worker, and observe the number of instances. The new worker joins the training, and after completing the acquisition and initialization of the data set, the training task will continue the training without interruption. The discover_hosts.sh is as follows:

Bash-5.0$ kubectl exec tensorflow-mnist-elastic-launcher-/ etc/mpi/discover_hosts.shtensorflow-mnist-elastic-worker-0:1tensorflow-mnist-elastic-worker-1:1bash-5.0$ kubectl edit mpijob/tensorflow-mnist-elasticmpijob.kubeflow.org/tensorflow-mnist-elastic editedbash-5.0$ kubectl exec tensorflow-mnist-elastic-launcher-- / etc/mpi/discover_hosts.shtensorflow-mnist-elastic-worker-0:1tensorflow-mnist-elastic-worker-1:1tensorflow-mnist-elastic-worker-2:1

Finally, we try to adjust the number of instances to one. Two instances in the training cluster will be recycled, and the training will continue.

Bash-5.0$ kubectl edit mpijob/tensorflow-mnist-elasticmpijob.kubeflow.org/tensorflow-mnist-elastic editedbash-5.0$ kubectl get poNAME READY STATUS RESTARTS AGEtensorflow-mnist-elastic-launcher 1 4m48stensorflow-mnist-elastic-worker-0 1 Running 0 4m48stensorflow-mnist-elastic-worker-0 1 Running 0 4m48stensorflow-mnist-elastic-worker-1 1 Terminating 0 4m48stensorflow-mnist-elastic -worker-2 1 Terminating 0 2m21sThu Mar 11 01:53:18 2021 [1]: Step # 40 Loss: 0.284265Thu Mar 11 01:53:18 2021 [0]: Step # 40 Loss: 0.259497Thu Mar 11 01:53:18 2021 [2]: Step # 40 Loss: 0.229993Thu Mar 11 01:54:27 2021 [2]: command terminated with exit code 137Process 2 exit with status code 137.Thu Mar 11 01:54:27 2021 [0]: command Terminated with exit code 137Process 0 exit with status code 137.Thu Mar 11 01:54:57 2021 [1]: [2021-03-11 01 Horovod background loop uncaught exception: [/ tmp/pip-install-2jy0u7mn/horovod/third_party/compatible_gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [10.244.2.27]: 54432WARNING:root:blacklist failing Host: tensorflow-mnist-elastic-worker-2WARNING:root:blacklist failing host: tensorflow-mnist-elastic-worker-1Thu Mar 11 01:54:58 2021 [1]: Step # 50 Loss: 0.207741Thu Mar 11 01:55:00 2021 [1]: Step # 60 Loss: 0.119361Thu Mar 11 01:55:02 2021 [1]: Step # 70 Loss: 0.131966

This shows that through the support of MPIJob, Horovod Elastic can manually expand and scale down to meet business needs. In the follow-up work, we will continue to support HorizontalPodAutoscaler's advanced features such as automatic expansion and reduction of specified instances to meet more scenarios.

These are all the contents of the article "what is the flexible and distributed use of AllReduce?" Thank you for reading! I believe you will gain a lot after reading this article. The editor will update different knowledge for you every day. If you want to learn more knowledge, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.