Flexible distributed deep learning system 07/02 Update SLTechnology News&Howtos

Flexible distributed deep learning system

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

On September 11th, Ant Financial Services Group announced on Google Developer Day Shanghai 2019 that he had opened up ElasticDL, a distributed deep learning system based on TensorFlow 2.0 eager execution. As far as we know, ElasticDL is the first deep learning system based on TensorFlow to support flexible scheduling. Wang Yi, the project manager, shared with us the design intention and status quo of the ElasticDL project, especially the technical relationship between ElasticDL and TensorFlow 2.0 and Kubernetes.

Technical thinking of distributed Deep Learning

TensorFlow-based distributed training systems can be divided into the following four categories:

Among them, ElasticDL is located in the upper right corner of the field grid. The reason for choosing this technical idea is to make use of Kubernetes to achieve fault tolerance and flexible scheduling.

High performance computing and cloud computing

In the early stage of deep learning technology research and development, the personnel involved are relatively small, the number of people sharing a computing cluster is relatively small, and the coordination between computing tasks can be achieved through oral communication. People are more concerned about shortening the running time, that is, from the start to the end of the job. High performance Computing (HPC) is an effective way to solve this problem, such as cuBLAS and cuDNN of NVIDIA to optimize high performance mathematical computing, and NCCL to optimize the communication efficiency between GPU.

With the large-scale use of deep learning technology, many engineers and researchers share a cluster, through discussion to coordinate scheduling is obviously not feasible, we began to use cluster management system to schedule distributed jobs. Among them, Kubernetes stands out in recent years and has been widely used in major public clouds.

Cloud computing and flexible scheduling

A common way to start distributed TensorFlow jobs on Kubernetes is to use Google Cloud's open source Kubeflow. Kubeflow is a "plug-in" of Kubernetes that asks Kubernetes which machines it plans to allocate to run each process in a distributed job, and then tells each process, the IP address and port of all other processes. So as to ensure that each process in an assignment knows each other.

Why do all processes need to know each other? This is required by the TensorFlow ps-based distribution approach (top left in the above table). The native distributed training function of TensorFlow 1.x allows all processes in a job to execute TensorFlow 1.x runtime programs. These processes communicate with each other and coordinate with each other to form a "distributed runtime" to interpret and execute computational diagrams (graph) that represent deep learning computing processes. At the beginning of distributed training, graph is broken down into several sub-graphs by TensorFlow runtime; each process is responsible for executing a sub-graph-- if any process fails (which may be preempted by a higher priority job), the execution of the whole picture fails. So TensorFlow's native distributed training capabilities are not fault-tolerant. However, it can recover from errors (fault-recoverable)-- TensorFlow API provides the ability to checkpoint; if a job fails, you can restart the job and continue execution from the most recent checkpoint.

Kubeflow can start jobs based on TensorFlow's native distributed computing power on Kubernetes. But because the latter is not fault-tolerant, Kubeflow cannot be made out of nothing. It cannot be fault-tolerant, which also means that it cannot be scheduled flexibly.

Demand for flexible scheduling

In the case of computing clusters shared by many people, supporting flexible scheduling means greatly improving team efficiency and overall cluster utilization. The former supports rapid iteration to keep the technology ahead; the latter determines enterprise costs and the profitability of the cloud computing business.

An example showing the effect of flexible scheduling is as follows. Suppose there are N GPU in a cluster, and one job package contains one process, which takes up 2 GPU. The second job requires 2 GPU;, but there are only 2 free GPU in this time group. If there is no flexible scheduling capability, the second job is forced to wait until the first job finishes releasing resources. This wait time is likely to be of the same order of magnitude as the running time of the second job. At this point, the utilization of the cluster is very low, which is 50%. If there is flexible scheduling, then the second job can be started immediately and calculated with 2 GPU. If there are more free resources in the future, the scheduling system can increase the number of its processes and make full use of the resources.

Another example is to assume that a job is already executing, and a new higher priority job requires resources, so the scheduling system preempt several processes of the first job to free up funding sources to start the second job. Without flexible scheduling and fault tolerance, the first job fails and all processes end. Until there are enough resources to restart it and continue along the nearest checkpoint. If there is flexible scheduling, the remaining processes of the first job can continue to execute, just because there are fewer available processes (GPU), so the speed is slower.

The above two examples show the improvement of flexible scheduling on cluster utilization and the protection of team work efficiency. It should be noted that fault tolerance and flexible scheduling are cause and effect each other. Fault tolerance means that the job is not affected by changes in the number of processes in it. When flexible scheduling, the number of processes in the job will increase or decrease with the cluster workload, so the job must be fault-tolerant in order to cooperate with the scheduling system to achieve flexible scheduling. Because of this, flexible scheduling depends on the cooperation of distributed programming framework and scheduling system.

Today, many distributed programming frameworks can work with Kubernetes to achieve fault tolerance and flexible scheduling. Such as Spark for offline data processing, Storm for online data processing, online streaming data engine Flink, distributed storage systems Redis and HBase. Among them, the frame suitable for deep learning is Paddle EDL. As far as we know, ElasticDL is the first deep learning system based on TensorFlow to support flexible scheduling.

Flexible scheduling of Kubernetes-native

ElasticDL implements flexible deep learning by implementing a framework for Kubernetes-native and calling TensorFlow 2.0.

The so-called Kubernetes-native means that a program calls Kubernetes API to start and stop the process. Google MapReduce is a distributed computing framework of Borg-native. The user starts a MapReduce job by running a client of Borg. The Borg client invokes the Borg API submission job and starts a master process. This master calls Borg API to start other workers processes. Similarly to ElasticDL, the user calls ElasticDL's command-line client program to start the job. This client program calls Kubernetes API to start the master process. The master process continues to call Kubernetes API to start other processes. Master processes can also call Kubernetes API to monitor other processes.

If the worker fails, according to the mathematical characteristics of the distributed deep learning training algorithm, we can ensure that the training process continues without processing. If a parameter server process dies, master selects a worker process and asks it to switch roles to replace the failed parameter server process. In both cases, master calls Kubernetes API, asking it to start an additional worker process. If the startup is successful, master will get started with it and join in collaboration with other processes. The state of the master process (mainly the three task queues:todo, doing, done) can be retained in the etcd storage system of the Kubernetes cluster. In this way, in case master fails, the restarted master process can inherit the state of a previous life from etcd.

The above is a simplified description. ElasticDL implements a variety of distributed computing models, each of which implements fault-tolerance in a slightly different way. We will describe it in detail in subsequent articles.

Kubernetes-native architecture gives master processes the opportunity to cooperate with Kubernetes to achieve fault tolerance and flexible scheduling. However, because ElasticDL calls Kubernetes API, it means that ElasticDL can only run on Kubernetes.

TensorFlow's native distributed computing power is not Kubernetes-native 's. So TensorFlow is not bound to the Kubernetes platform. This is why you need to rely on Kubernetes's extended Kubeflow if you want to use existing techniques to run TensorFlow jobs in Kubernetes.

In theory, a certain degree of fault tolerance can be achieved without calling Kubernetes API. Even without Kubernetes notification, master can determine the survival of other processes by checking other inherited heartbeats (heartbeat) or checking the status of TCP links. However, without calling Kubernetes API (or the API of another scheduling system), master cannot tell the scheduling system to restart the process, know the information about the newly started process, and help it join the job. This "non-Kubernetes-native" fault-tolerant method is quite passive, and it can only accept the fact that some processes are preempted and hang up when resources are tight, but can not increase processes to make full use of idle resources after other jobs release resources.

TensorFlow 2.0

As explained above, in order to ensure that the core runtime of TensorFlow is platform-independent, we cannot achieve complete active fault tolerance and flexible scheduling by modifying runtime. Therefore, as shown in the Tian grid at the beginning of the text, both ElasticDL and Uber Horovod are wrapped on the API of TensorFlow.

Horovod is based on TensorFlow 1.x. Each process of a Horovod job calls stand-alone TensorFlow to do local calculations, then collects gradients, and aggregates gradients and updates the model through AllReduce calls. Horovod is also platform independent, so the AllReduce operations it provides do not support fault tolerance and flexible scheduling. This is different from ElasticDL.

Like ElasticDL, Horovod needs to secretly "intercept" gradients from TensorFlow. In TensorFlow 1.x, deep learning calculations are represented as a graph and interpreted by TensorFlow runtime, so Horovod has to enter the process of graph execution in order to get the calculated gradients of each process and AllReduce them. For this reason, Horovod requires users to use a specific optimizer instead of the optimizer provided by TensorFlow, so that the gradients can be revealed during the optimization model phase.

The structure of a user program that calls Horovod is as follows. The parts marked (*) and (*) are the code that Horovod requires users to write to help Horovod intercept the gradients calculated by TensorFlow. If the user forgets to write, the result of the program execution will be incorrect.

ElasticDL does not have these problems because it relies on TensorFlow 2.0. The eager execution mode of the main push of TensorFlow 2.0 adopts a completely different deep learning calculation method from the interpretation execution diagram. Similar to PyTorch, the forward computing process records calls to the basic computing unit (operator) in an in-memory data structure tape, and then the reverse computing process (calculating the gradients) can backtrack the tape to call the gradient operator corresponding to the operator. This tape provides an operation that allows the user to get the gradient for each parameter.

ElasticDL can get the gradients directly by calling TensorFlow 2.0 API:

And the above code does not need to be written by the user, but is part of ElasticDL. The code that ElasticDL users need to write corresponds to one line in the above Horovod code example-- defining the model.

Minimalist API and usage

Training a model requires not only the above model definition, but also data, optimization objectives (cost), and optimization algorithms (optimizer). Users always want to specify this information in a concise way and describe the training job with as little code as possible.

ElasticDL and TensorFlow's other high-level API, such as Keras and TensorFlow Estimator, almost call an API function to perform a distributed training job. The next program uses Keras. Keras uses TensorFlow native distributed training capabilities and does not support fault tolerance and flexible scheduling.

ElasticDL's API is relatively more concise. The ElasticDL version of the above sample program is as follows:

The main difference is that users choose a distributed execution strategy in Keras programs, but not in ElasticDL programs. This is because ElasticDL automatically chooses distributed training algorithms and strategies.

In a nutshell, ElasticDL enables asynchrnous SGD for models with large parameters (model parallelism is required). This method combined with delayed model update can reduce network traffic by an order of magnitude. Many models of NLP, search, recommendation, and advertising fit this category. Asynchronous SGD's performance for this kind of model is relatively stable. For models with small parameters such as image recognition and speech recognition, the ElasticDL team is developing an AllReduce for Kubernetes-native. Like the AllReduce used by Horovod, ElasticDL AllReduce organizes the topology of inter-process communication into a ring to achieve high-performance model updates. In contrast, ElasticDL AllReduce is fault-tolerant-in the case of a process failure that causes the AllReduce call to fail, the master organizes the remaining living processes to construct a new ring.

The ElasticDL project hopes to provide a high-performance and easy-to-use deep learning system through this divide-and-conquer strategy.

The relationship between ElasticDL and SQLFlow

Earlier this year, Wang Yi's team opened up SQLFlow. Users can use the extended SQL syntax to describe the entire data flow and AI process very succinctly.

For example, if we want to build a recommendation system for an e-commerce site, we need to develop modules such as log collection, online data cleaning, feature engineering, model training, verification and prediction. Each module may need to be put into a team axis or even months.

In recent years, many Internet services have begun to upload data directly to general databases. For example, much of Ant Financial Services Group's data is on ODPS (the MaxCompute service on Aliyun) and a new generation of intelligent data systems. This prompts us to consider doing data cleaning and preprocessing in the database, while feature engineering, automatic machine learning, and training processes are done in AI engines such as ElasticDL. SQLFlow translates the SQL program with extended syntax into a Python program, linking the two parts.

In such a scenario, if AI requires many parameters, the user also needs to provide these parameters in the SQL program. For example, the following SQL statement extracts the user's age, work department, and workplace from the database to predict their income.

Among them, the TRAIN clause specifies the model to be trained; the COLUMN clause specifies how to map the data into features; LABEL specifies the value to be predicted; WITH specifies various parameters in the training process, in which dist_strategy is the distributed strategy that needs to be specified when calling Keras/TensorFlow for training, and gpus specifies the resources needed. These are not needed when SQLFlow calls ElasticDL, because ElasticDL automatically selects distributed policies and algorithms.

As can be seen from this example, the artificial intelligence engine needs to be more intelligent, including AutoML and automatic feature engineering, if users are to provide as few parameters as possible. The ElasticDL project has a long way to go. We look forward to simplifying the above SQL program to the following form:

The status quo of the ElasticDL project

The ElasticDL project is in its early exploratory stage. API is in the process of evolution. This open source version does not include code to automatically select distribution strategies and algorithms. Compared with the implementation of distributed computing in TensorFlow runtime, the performance of distributed training based on Python API based on TensorFlow 2.0 eager mode still lags far behind. The ElasticDL team is working with the Google Brain team to develop the above asynchronous SGD + delayed model update capabilities, as well as Kubernetes-native AllReduce. I hope it can be made available to you in the next version.

At present, the distributed SGD training method based on parameter server implemented by ElasticDL verifies fault tolerance and flexible scheduling. It also runs on both the Kubernetes 1.12 cluster on Google Cloud and Ali Sigma 3.1 (a high-performance implementation of Kubernetes). Also, the ElasticDL team developed code generator for SQLFlow to generate ElasticDL programs.

We hope to open source ElasticDL and share its design intention as soon as possible, bring together efforts from different companies and communities, explore the distributed training ecology of Google TensorFlow 2.0 and Kubernetes, and achieve a convenient end-to-end artificial intelligence development kit as soon as possible.

Which hospital does Shenyang go to to see venereal diseases: http://www.120sysdyy.com/

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.