Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Microsoft launched FP8 hybrid precision training framework: 64% faster than BF16 and 42% less memory.

2025-01-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Share

Shulou(Shulou.com)11/24 Report--

CTOnews.com, November 10, the Big language Model (LLM) is rising rapidly, showing a bright prospect in language generation and understanding. Its influence extends beyond the language field to the fields of logic, mathematics, physics and so on.

However, it takes a high price to unlock these "extraordinary energies". For example, training the 540B model requires 6144 TPUv4 chips from Project PaLM, while training 175B GPT-3 requires thousands of Petaflop/s-day.

At present, a good solution is low-precision training, which can improve processing speed and reduce memory usage and communication costs. Including mainstream training systems such as Megatron-LM, MetaSeq and Colossal-AI, FP16 / BF16 mixed accuracy or FP32 full accuracy are used by default to train large language models.

Although these levels of accuracy are essential for large language models, their computational costs are high.

If FP8 low precision is adopted, the speed can be increased by 2 times, the memory cost can be reduced by 50% to 75%, and the communication cost can be saved.

At present, only Nvidia Transformer Engine is compatible with FP8 framework, which mainly uses this accuracy for GEMM (General Matrix Multiplication) calculation, while maintaining sovereign weight and gradient with high precision of FP16 or FP32.

To meet this challenge, a team of researchers from Microsoft Azure and Microsoft Research launched an efficient FP8 hybrid precision framework, tailored for large language model training.

Microsoft introduced three optimization stages, using FP8 for distributed and hybrid precision training. With the progress of these levels, the improvement of the degree of FP8 integration becomes obvious, which indicates that it has a greater impact on the LLM training process.

In addition, in order to overcome the problems of data overflow or underflow, Microsoft researchers propose two key methods: automatic sampling and accurate decoupling. The former involves components that are not sensitive to accuracy to reduce accuracy and dynamically adjust Tensor sampling factors to ensure that gradient values remain within the range of FP8 representation. This prevents the total reduction of underflow and overflow events during communication and ensures a smoother training process.

Microsoft has tested that, compared with the widely used BF16 hybrid precision method, memory footprint is reduced by 27% to 42%, and weight gradient communication overhead is significantly reduced by 63% to 65%. It runs 64% faster than widely used BF16 frameworks (such as Megatron-LM) and 17% faster than Nvidia Transformer Engine.

When training the GPT-175B model, the hybrid FP8 precision framework saves 21% of the memory on the H100 GPU platform, and the training time is reduced by 17% compared with TE (Transformer Engine).

CTOnews.com attached here GitHub address and paper address: https://doi.org/10.48550/arXiv.2310.18313

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

IT Information

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report