How to use DC/OS to build a deep learning platform 04/26 Update SLTechnology News&Howtos

How to use DC/OS to build a deep learning platform

2025-04-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to use DC/OS to build a deep learning platform, in view of this problem, this article introduces the corresponding analysis and answers in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.

The following is a summary of "Building a distributed Deep Learning platform using Apache Mesos (GPU Resource Scheduler and Gang Scheduler)" published by Min Cai,Alex Sergeev,Paul Mikesell and Anne Holler of Uber on MesosCon in North America this year. The Uber team shared how they built a distributed learning platform on Apache Mesos (the underlying core technology is Mesosphere DC / OS) for self-driving research, predicting hitchhiking, ride-sharing trends, and preventing fraud. Uber adopts a distributed deep learning platform called Horovod, which optimizes the operation model using GPU, and schedules GPU resources by using a custom scheduling algorithm called Peloton, so that multiple GPU can run as easily as a GPU.

Distributed Deep Learning to optimize Uber Operation

The secret of Uber's success in many key areas is the use of distributed deep learning techniques. Uber's self-driving is a core part of the company's future business. Computing clusters need a lot of data processing and computing when dealing with three-dimensional models of computer vision and depth perception, regional maps, weather and many other navigation factors. A key indicator of the success of such a program is how quickly a new computing model can be trained through a deep learning cluster. The research of self-driving is an important part of Uber in the future. Uber has also built many models around shared travel services, and it faces the same problems as any other supply chain model. In travel forecasting, demand is predictable over time, and data input is made up of online and available drivers. Uber will analyze historical data, combined with weather, upcoming regional events and other data to remind Uber drivers of the coming peak travel demand so that they can deal with it. The same deep learning system is also used to mark and prevent fraud by customers and drivers.

Distributed deep learning for speed and scale

Uber processes data on a considerable scale. The dataset and deep learning model used by Uber to train the model can be handled on a much larger scale than a single host, a single GPU and a single task. The single-thread model of a commonly used main process with multiple worker threads cannot be extended to develop a deep learning model that is based on big data and meets the tight time-to-market needs of Uber and other leading Silicon Valley companies. Uber engineers realized that a fully distributed, many-to-many model would be more valuable because it could be orchestrated, prioritized, resource planned, and processed by service groups. Through data slicing in the cluster to shorten the network delay, and then minimize the delay, generate training models in hundreds of GPU and other ways to improve the efficiency of model training. The core of Uber distributed deep learning platform is based on the combination of Apache Mesos, Peloton and Horovod.

In-depth learning based on Apache Mesos

After evaluating many solutions by combining the advantages of the community and the characteristics of the platform, Uber finally chose Apache Mesos. The platform has been widely adopted by the market and used in traditional enterprises and Internet companies, which has attracted the attention of Uber. The maturity of the Mesos community ensures the timeliness, relevance and stability of code submission in the upstream community. In the process of selecting a scheduling platform focused on distributed deep learning, Uber also takes into account the proven scalability, reliability and high customization, as well as features such as native GPU support, nested containers and Nvidia GPU isolation. Apache Mesos provides native support for GPU, while for other container frameworks, Uber may need to extract its own upstream code patches, develop its own code base, and support it in a production environment. Apache Mesos also embeds the correct version of CUDA in its container, which simplifies the deployment of a distributed deep learning framework. Uber uses the unique features of Apache Mesos to run distributed TensorFlow in a containerized way, encapsulating the management code of TensorFlow into sub-containers or nested containers, making it independent of the developer's workflow. As a result, Uber developers have become more efficient and can focus on model development and training without having to manage TensorFlow itself.

Uber Peloton batch scheduling

Although Apache Mesos is a very good tool for task scheduling, resource management and task failure recovery, Uber wants to add its own capabilities to run distributed deep learning workflows in its own way. In particular, Uber adds finer granularity to TensorFlow workflow management and scheduling. For this reason, the Uber R & D team developed Peloton, Uber's own workflow manager, focusing on customized workflow / task lifecycle management, task scheduling, and task preemption. Now, Uber can manage tasks with better granularity, and a workflow submission can contain hundreds of tasks, which are considered to be a batch. Batches are treated as a single atomic unit in Peloton, where each task runs in a separate container.

Uber Horovod enhances TensorFlow's support for multi-GPU

After Uber solves the core resource management and batch scheduling, the R & D team turns its attention to developing a multi-GPU task scheduling solution that is friendly to R & D personnel. At Uber, they use TensorFlow as a deep learning model, but find that multiple GPU support is troublesome and error-prone for developers. To improve this, the team developed Horovod, a multi-GPU workflow submission interface based on a single GPU model job. As a result, they greatly reduce the probability of developers making mistakes and the time it takes to get the job done. In an example shown in MesosCon North America, with Horovod, their GPU cluster efficiency increased from 56% to 82%, an increase of 28%.

This is the answer to the question on how to use DC/OS to build a deep learning platform. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel to learn more about it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.