Everyone has a ChatGPT, Microsoft DeepSpeed Chat shocked the release, one-click RLHF training hundreds of billions of big models 04/19 Update SLTechnology News&Howtos

Everyone has a ChatGPT, Microsoft DeepSpeed Chat shocked the release, one-click RLHF training hundreds of billions of big models

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

Microsoft's open source DeepSpeed Chat enables developers to realize their dream of a ChatGPT!

Everyone's dream of a ChatGPT is about to come true?

Just now, Microsoft has opened up a system framework-DeepSpeed Chat, which can add a complete RLHF process to model training.

In other words, high-quality ChatGPT-like models of all sizes are now readily available!

Project address: https://github.com/ microsoft / DeepSpeed unlock hundreds of billions of level ChatGPT with one button, easy to save 15 times as we all know, because OpenAI is not Open, the open source community has launched LLaMa, Alpaca, Vicuna, Databricks-Dolly and other models in order to allow more people to use the class ChatGPT model.

However, due to the lack of an end-to-end RLHF scale system, the training of ChatGPT-like models is still very difficult. And the emergence of DeepSpeed Chat, just to complete this "bug".

To make it even brighter, DeepSpeed Chat knocked down the cost a lot.

Previously, expensive multi-GPU settings were beyond the reach of many researchers, and even if they could access multi-GPU clusters, existing methods could not afford the training of hundreds of billions of parameter ChatGPT models.

Now, for as little as $1620, you can train an OPT-66B model in 2. 1 days through the hybrid engine DeepSpeed-HE.

If you use a multi-node, multi-GPU system, DeepSpeed-HE can train an OPT-13B model in 1.25 hours for $320 and an OPT-175B model in less than a day for $5120.

Elvis, a former Meta AI expert, was excited to retweet, calling it a big deal and wondering how DeepSpeed Chat compares to ColossalChat.

Next, let's see how it works.

After the training of DeepSpeed-Chat, the 1.3 billion parameter version of "ChatGPT" is very outstanding in the question and answer session. It can not only get the context of the question, but also give a good answer.

In many rounds of conversations, the performance of this 1.3 billion parameter version of "ChatGPT" also completely exceeds the inherent impression of this scale.

A piece of code to generate your first ChatGPT, of course, before you experience it, you need to configure the environment:

Git clone https://github.com/microsoft/DeepSpeed.gitcd DeepSpeedpip install .git clone https://github.com/microsoft/DeepSpeedExamples.gitcd DeepSpeedExamples/lications/DeepSpeed-Chat/pip install-r requirements.txt a cup of coffee, after training the 1.3 billion parametric version of ChatGPT, if you only have about 1-2 hours of coffee or lunch break, you can also try to use DeepSpeed-Chat to train a "little toy".

The team specially prepared a training example for the 1.3B model, which can be tested on the consumer GPU. Best of all, when you come back from your lunch break, everything is ready.

Python train.py-actor-model facebook/opt-1.3b-reward-model facebook/opt-350m-num-gpus 1 consumer grade NVIDIA A6000 GPU with 48GB video memory:

A GPU Node with 13 billion parameters in half a day. If you only have half a day and a server node, you can use the pre-trained OPT-13B as the actor model and OPT-350M as the reward model to generate a similar ChatGPT model with 13 billion parameters:

Python train.py-actor-model facebook/opt-13b-reward-model facebook/opt-350m-num-gpus 8 single DGX node with 8 NVIDIA A100-40G GPU:

Ultra-cost-saving cloud solution, training 66 billion parameter model if you can use multi-node clusters or cloud resources, and want to train a larger, higher-quality model. Then just enter the model size (such as 66B) and the number of GPU (such as 64) you want based on the following line of code:

Python train.py-actor-model facebook/opt-66b-reward-model facebook/opt-350m-num-gpus 648 DGX nodes, each equipped with 8 NVIDIA A100-80g GPU:

Specifically, the time and cost of the DeepSpeed-RLHF system for different models and hardware configurations are as follows:

What is DeepSpeed Chat? DeepSpeed Chat is a general system framework, which can implement end-to-end RLHF training similar to ChatGPT model, which helps us to generate our own high-quality ChatGPT-like model.

DeepSpeed Chat has the following three core functions:

1. Training and strengthening reasoning experience of simplified ChatGPT type model

Developers only need one script to implement multiple training steps, and after completion, they can also use inference API for interactive testing.

2. DeepSpeed-RLHF module

DeepSpeed-RLHF replicates the training patterns in InstructGPT papers and provides data abstraction and mixing functions that enable developers to train using multiple data sources from different sources.

3. DeepSpeed-RLHF system

The team integrates DeepSpeed training (training engine) and reasoning ability (inference engine) into a unified hybrid engine (DeepSpeed Hybrid Engine or DeepSpeed-HE) for RLHF training. Because DeepSpeed-HE can seamlessly switch between reasoning and training modes, various optimizations from DeepSpeed-Inference can be taken advantage of.

The DeepSpeed-RLHF system has unparalleled efficiency in large-scale training, making complex RLHF training fast, economical, and easy to popularize on a large scale:

Efficient and economical:

DeepSpeed-HE is more than 15 times faster than the existing system, making RLHF training fast and economical.

For example, DeepSpeed-HE can train an OPT-13B model in 9 hours on the Azure cloud and an OPT-30B model in 18 hours. The two types of training cost less than $300 and $600, respectively.

Superior scalability:

DeepSpeed-HE can support the training of models with hundreds of billions of parameters, and demonstrate excellent scalability on multi-node and multi-GPU systems.

Therefore, even a model with 13 billion parameters can complete the training in 1.25 hours. For the model with 175 billion parameters, it takes less than a day to train with DeepSpeed-HE.

To popularize RLHF training:

A single GPU,DeepSpeed-HE can support training models with more than 13 billion parameters. This enables data scientists and researchers who cannot use multi-GPU systems not only to easily create lightweight RLHF models, but also to create large and powerful models to cope with different usage scenarios.

Complete RLHF training process in order to provide a seamless training experience, researchers follow InstructGPT and include a complete end-to-end training process in DeepSpeed-Chat.

DeepSpeed-Chat 's RLHF training process diagram, including some optional features

The process consists of three main steps:

Step 1:

Supervised fine-tuning (SFT), using selected human answers to fine-tune the pre-trained language model to respond to a variety of queries.

Step 2:

The reward model is fine-tuned, training a separate (usually smaller than SFT) reward model (RW) with a dataset that contains human ratings for multiple answers to the same query.

Step 3:

RLHF training, in which the SFT model is further fine-tuned from the reward feedback of the RW model by using the approximate strategy optimization (PPO) algorithm.

In step 3, the researchers also provided two additional features to help improve the quality of the model:

-for the collection of exponential moving average (EMA), you can select an EMA-based checkpoint for final evaluation.

-mixed training, mixing the pre-training goal (that is, the next word prediction) with the PPO goal to prevent performance regression on common benchmarks such as SQuAD2.0.

EMA and hybrid training are two training features that are often ignored by other open source frameworks because they do not hinder the training.

However, the InstructGPT,EMA checkpoint often provides better response quality than the traditional final training model, and hybrid training can help the model to maintain the benchmark resolution ability before training.

Therefore, the researchers provide users with these features so that they can fully gain the training experience described in InstructGPT.

In addition to being highly consistent with InstructGPT papers, the researchers also provide features that allow developers to use a variety of data resources to train their own RLHF models:

Data abstraction and mixing capabilities:

DeepSpeed-Chat is equipped with (1) an abstract data set layer to unify the format of different data sets, and (2) a data split / mixing function so that multiple data sets are properly mixed and then segmented in three training stages.

The DeepSpeed hybrid engine indicates steps 1 and 2 of the guided RLHF pipeline, similar to regular fine-tuning of large models, which are a combination of ZeRO-based optimization and flexible parallel strategies in DeepSpeed training to achieve scale and speed.

Step 3 of the pipeline is the most complex part of the performance impact.

Each iteration needs to effectively deal with two phases: a) the reasoning phase, which is used for the generation of token / experience to generate training input; and b) the training phase, which updates the weights of actor and reward models, as well as the interaction and scheduling between them.

It introduces two major difficulties: (1) memory cost, as multiple SFT and RW models need to be run throughout the third phase; and (2) the slow generation of answers, which, if not accelerated correctly, will significantly slow down the entire third phase.

In addition, two important functions added by the researchers in the third phase, exponential moving average (EMA) collection and hybrid training, will generate additional memory and training costs.

In order to meet these challenges, the researchers put all the systematic capabilities of DeepSpeed training and reasoning into a unified infrastructure, namely hybrid engine (Hybrid Engine).

It uses the original DeepSpeed engine for fast training mode, and effortlessly uses DeepSpeed inference engine for generation / evaluation mode, which provides a faster training system for the third stage of RLHF training.

As shown in the following figure, the transition between DeepSpeed training and inference engines is seamless: by enabling typical eval and train patterns for the actor model, DeepSpeed chooses different optimizations when running the reasoning and training process to run the model faster and improve the throughput of the entire system.

Design of DeepSpeed hybrid engine used to accelerate the most time-consuming part of RLHF process during the reasoning execution of the experience generation phase of RLHF training, the DeepSpeed hybrid engine uses a lightweight memory management system to deal with KV cache and intermediate results, while using a highly optimized reasoning CUDA core and tensor parallel computing. Compared with the existing schemes, the throughput (token per second) is greatly improved.

During the training, the hybrid engine enabled memory optimization techniques, such as DeepSpeed's ZeRO series technology and low-order adaptive (LoRA).

The way researchers design and implement these system optimizations is to make them compatible and can be combined to provide the highest training efficiency under a unified hybrid engine.

The hybrid engine can seamlessly change the model partition in training and reasoning to support tensor-based parallel reasoning and ZeRO-based training slicing mechanism.

It can also reconfigure the memory system to maximize memory availability in each mode.

This avoids memory allocation bottlenecks, supports large batch size, and greatly improves performance.

In short, hybrid engines drive the boundaries of modern RLHF training, providing unparalleled scale and system efficiency for RLHF workloads.

Compared with existing systems such as Colossal-AI or HuggingFace-DDP, DeepSpeed-Chat has more than an order of magnitude of throughput and can train larger actor models or similar models at a lower cost under the same delayed budget.

For example, on a single GPU, DeepSpeed increases the throughput of RLHF training by more than 10 times. Although both CAI-Coati and HF-DDP can run 1.3B models, DeepSpeed can run 6.5B models on the same hardware, directly five times higher.

On multiple GPU of a single node, DeepSpeed-Chat is 6-19 times faster than CAI-Coati in terms of system throughput, and HF-DDP is 1.4-10.5 times faster.

The team said that one of the key reasons DeepSpeed-Chat achieved such excellent results was the acceleration provided by the hybrid engine during the build phase.

Reference:

Https://github.com/microsoft/DeepSpeed

This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.