OGAI explains in detail: how to achieve efficient and long-term continuous training for large models by AIStation scheduling platform 04/16 Update SLTechnology News&Howtos

OGAI explains in detail: how to achieve efficient and long-term continuous training for large models by AIStation scheduling platform

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

Large model is the core technology for the development and innovation of general artificial intelligence industry. At present, there are more than 100 generative AI models published in China. In the face of generative AI development and application scenarios with large models as the core, Tide Information recently released large model intelligence software stack OGAI (Open GenAI Infra), which provides a full-stack and full-process intelligence software stack for large model business, including AI computing system environment deployment, computing power scheduling guarantee, model development management, and so on. The OGAI software stack consists of five layers, from L0 to L4, corresponding to the OS products of the intelligence center in the infrastructure layer, the PODsys products in the system environment layer, the AIStation products in the scheduling platform layer, the YLink products in the model tool layer and the MModel products in the multi-model management layer.

Among them, L2 layer AIStation is an AI computing power scheduling platform for large model development. AIStation systematically optimizes the use and scheduling of resources, training flow and guarantee, algorithm and application management in large model training, and has the ability to continue training at large model breakpoints to ensure long-term continuous training. The training arithmetic efficiency of the large model of "source" of wave information supported by AIStation reaches 44.8%. A large commercial bank built a large-scale parallel computing cluster based on AIStation to help it fully tap its computing potential for large-scale model training, and won the 2022 IDC "Future Digital Infrastructure Leader" award.

This paper will focus on the challenges faced by large model training, how AIStation can improve the efficiency of large model training, and the results achieved.

First, large model training faces great challenges.

1. Huge cost of arithmetic and difficult problems in the use of arithmetic in large model training

The first challenge of large model training is massive data and computation, and the computational cost is huge. For example, GPT-3 is trained on 10000 GPU, and the "Source 1.0" model is trained on 2128 GPU through the AIStation platform to complete 180 billion tokens training. Training a trillion token 70 billion parameter model will cost millions of dollars. However, the performance of the computing platform usually can not increase linearly with the computing power, but will be depleted, so the large model training also needs efficient computing power scheduling to give full play to the efficiency of the computing platform. This not only depends on the optimization of algorithm and framework, but also needs the help of efficient computing power scheduling platform to achieve optimal computing power scheduling according to the hardware characteristics and computing load characteristics of computing power cluster, and improve the overall utilization of computing power and training efficiency.

two。 Time-consuming and complex maintenance of multiple network compatibility adaptations

In the process of large model training, thousands of GPU will communicate continuously within and between nodes. In order to obtain the best training effect, a single GPU server will carry multiple InfiniBand, ROCE and other high-performance network cards to provide high throughput and low delay services for inter-node communication. However, different network schemes have their own advantages and disadvantages. InfiniBand has been recognized as the first choice for large model training because of its excellent performance, but its cost is high. Although the cost of RoCE is low, its performance and stability are not as good as InfiniBand scheme in a large-scale network environment. Therefore, in order to meet the communication requirements of large model training, it is necessary to explore and design the adaptive use of communication equipment and network conditions in the cluster network.

3. Unstable large model training and high threshold system-level optimization

The training process of large model is more complex than traditional distributed training, and the training cycle is as long as several months. The low efficiency of cluster computing, frequent failures and complex processing will lead to the failure to recover in time after the interruption of training, which will reduce the success probability of large model training and keep the training cost of large models high. Therefore, the large model puts forward higher requirements for training stability, fault detection and training fault tolerance. At the same time, simplifying large model distributed task submission, realizing intelligent and automated task resource matching and training robustness are also important guarantees to improve training efficiency.

One of the major engineering problems encountered by Meta in training Open Pre-trained Transformer (OPT)-175B with the same size as GPT3 is the instability of training. As shown in the following figure, you can see that there are many time nodes where training stops, such as GPU card dropping, unexpected interruption of training caused by abnormal GPU performance, and so on. Training stability and effective breakpoint continuation training are urgent problems to be solved in large model training at present.

Fig. 1 unexpected interruption during OPT-175B training (where Abscissa is training time, ordinate is confusion PPL, source: https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/56_percent_update.md#infrastructure-stability))

In a word, if we want to shorten the training cycle and reduce the training cost, we need to solve various challenges such as computing power scheduling, network communication, training stability and so on to carry out large model training in the super-large-scale distributed environment. We should not only make flexible and full use of all the resources in the cluster, optimize the use and communication of data through a variety of means, but also deal with the exceptions of large-scale computing clusters in time.

Second, AIStation whole process simplification and speed up large model training wave information AIStation provides a systematic software and hardware integrated optimization platform and software stack ability to ensure the training needs of large models. The AIStation platform is systematically optimized from the aspects of resource use and scheduling, training process and guarantee, algorithm and application, and realizes the end-to-end optimization and acceleration of large model training.

Figure 2 AIStation fully supports and guarantees the large model business

Millisecond scheduling, efficient use of large-scale computing power to solve the problem of low computing power utilization

In the practice of large model training, AIStation optimizes the performance of the cloud native scheduling system and realizes thousands of POD start-up and environment-ready. As shown in the following table, compared with the native community version, AIStation scheduler can greatly improve the scheduling performance of large-scale POD tasks, especially to ensure the scheduling and use of computing resources trained by large models.

In addition, the AIStation platform can support the unique development mode of the large model, and provide a variety of scale job resources, including small-scale resource scheduling, large-scale resource scheduling, high-performance scheduling and so on. The computing power scheduler works out a reasonable job execution plan by dynamically and intelligently managing and allocating cluster computing resources, so as to maximize the use of resources, meet the delay and throughput requirements of all kinds of training tasks, and ensure the efficient and stable operation of the job. to achieve high utilization, strong expansibility and high fault tolerance of the computing platform.

Through a variety of efficient resource management and scheduling strategies, AIStation can achieve millisecond scheduling, increase the overall resource utilization to more than 70%, and help customers make better use of the computing power of computing clusters and give full play to the value of computing power.

two。 Efficient network resource management, Doka accelerates Prida by 90%, speeding up the training process

AIStation defines mutually independent computing high-performance network and storage high-performance network, and supports switch-level resource scheduling, reduces cross-switch traffic, and has the function of automatic network fault identification and handling. For the scenarios with high communication requirements for large model training, AIStation provides cluster topology awareness, and the container network is consistent with the cluster physical network, which ensures the interconnection performance of containers and meets the training communication requirements. Distributed communication optimization, combined with clustered InfiniBand or RoCE high-performance networks and specially optimized communication topologies, enables AIStation to achieve a multi-card speedup of 90% in kcal scale cluster testing. In particular, AIStation has also optimized the large model training under large-scale RoCE lossless network, and the measured network performance stability has reached a high level in the industry.

With the help of the AIStation platform, a large commercial bank has realized the mainstream large model training framework, such as DeepSpeed, Megatron-LM and large language model training in the RoCE network environment, and quickly realized the landing practice of the large model.

3. Large-scale training system level optimization, fault handling time reduced by 90%, minimize experimental cost large model task submission, often accompanied by a large number of environment configuration, dependent library adaptation and super-parameter adjustment. AIStation can automate the configuration of computing, storage and network environment, and provide custom modifications to some basic hyperparameters to facilitate users to use. Large model distributed training can be started in a few steps. At present, many large model training frameworks and open source programs are supported, such as Megatron-LM, DeepSpeed and so on.

Figure 3 Rapid deployment of Megatron-LM on AIStation to provide support for the whole training process

AIStation uses a self-developed data cache system on a large-scale training cluster to improve the data reading rate before and during training, and greatly reduce the dependence on storage system and network. With the optimized scheduling strategy, compared with the direct use of the storage system, the training efficiency of the model can be improved by 200% mi 300%, and the hardware performance can be released by 100%.

Robustness and stability are the necessary conditions for the efficient completion of large model training. In response to cluster emergencies such as resource failures, AIStation automatically performs fault-tolerant processing or flexible capacity expansion strategy to ensure that training tasks can recover as quickly as possible after interruption, providing a reliable environment for large models that require long-time training, and reducing exception fault handling time by more than 90% on average.

Fig. 4 exception handling and breakpoint continuation process of large-scale pre-training tasks

To sum up, for large-scale distributed computing, AIStation has a built-in distributed training adaptive system, which covers the whole life cycle of training, meets many demands of large model training, and provides resource usage view, computing and network scheduling strategy, distributed training acceleration, training monitoring, training fault tolerance and self-healing ability. While accelerating training, it can automatically locate faults and recovery tasks, ensuring the stability and efficiency of training. Under the guarantee of AIStation intelligent fault-tolerant mechanism, a bank customer can achieve rapid fault troubleshooting and recovery in the extremely stringent business commissioning test, and greatly reduce the business commissioning time.

Third, AIStation helps the industry to improve the efficiency of large model development.

AIStation platform has accumulated valuable experience and technology in AI development, application deployment and large model engineering practice, helping customers in many industries to reduce costs and increase efficiency at the level of resources, development and deployment. In the field of vertical industry, AIStation platform helps head financial customers and biopharmaceutical service companies quickly use intensive data to train and validate large models, which greatly reduces the business costs of large models. The parallel computing cluster built by a large commercial bank based on AIStation won the 2022 IDC "Future Digital Infrastructure Leader" award for its leading large-scale distributed training support.

Tide Information AIStation has made a lot of industry-leading experience and accumulation in the large model, and realized end-to-end optimization. It is a more suitable artificial intelligence platform for the large model era. In the future, AIStation will evolve with the wave information OGAI software stack to help customers quickly realize the development and landing of large models and seize the first opportunity through low-code, standardized large model development processes, as well as low-cost and efficient reasoning service deployment.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.