65 billion parameters large model pre-training program open source commercial! LLaMA training accelerates 38%, from star open source projects. 04/14 Update SLTechnology News&Howtos

65 billion parameters large model pre-training program open source commercial! LLaMA training accelerates 38%, from star open source projects.

2025-04-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

65 billion parameter large model pre-training scheme, release that is open source. The training speed is 38% higher than that of the traditional scheme.

This is the pre-training scheme of LLaMA-like basic large model newly released by Colossal-AI.

You know, in the context of the "hundred Model Wars", who has his own big model is often regarded as the core competitiveness. Under this node, very few companies are willing to open source large models. But I train a large model from scratch, which has high requirements for technology and capital.

As a result, the latest open source action of Colossal-AI can be said to be in response to the needs of the times. And it does not restrict commercial use, it only takes 4 steps to use it out of the box.

What are the contents of the specific project? Let's look down ~

Open source address: https://github.com/hpcaitech/ColossalAI

32 A100 / A800 can be used in fact, since Meta open source LLaMA, set off a wave of fine-tuning project craze, such as Alpaca, Vicuna, ColossalChat and so on are built on it.

However, LLaMA only opens up model weights and limits commercial use, and fine-tuning can enhance and inject relatively limited knowledge and capabilities.

For enterprises that really want to join the wave of large models, it is very important to train their own core large models.

The open source community has also done a series of work before:

RedPajama: open source commercial LLaMA-like dataset (no training code and models)

OpenLLaMA: open source commercial LLaMA 7B / 13B model, using EasyLM based JAX and TPU training

Falcon: open source commercial LLaMA 7B / 40B model (no training code)

But these are not enough, because for the most mainstream PyTorch+GPU ecology, there is still a lack of efficient, reliable and easy-to-use pre-training programs for LLaMA-like basic large models.

So Colossal-AI handed over the latest open source answer. Only 32 pieces of A100 / A800 can be used for pre-training of 65 billion parameter LLaMA model, and the training speed is increased by 38%.

For example, native PyTorch, FSDP, etc., cannot run the task because of video memory overflow.

Hugging Face accelerate, DeepSpeed and Megatron-LM also did not provide official support for LLaMA pre-training.

Out of the box, it can be done in four steps, and it's easy to really get started with this project. There are four steps:

1. Install Colossal-AI

2. Install other dependencies

3. Data set

4. Run the command

The specific code is as follows:

The first step, install Colossal-AI.

Git clone-b example/llama https://github.com/hpcaitech/ColossalAI.gitcd ColossalAI# install and enable CUDA kernel fusionCUDA_EXT=1 pip install. Step two, install other dependencies.

Cd examples/language/llama# install other dependenciespip install-r requirements.txt# use flash attentionpip install xformers step 3, dataset.

The default dataset togethercomputer / RedPajama-Data-1T-Sample will be downloaded automatically the first time it is run, or you can specify a custom dataset through-d or-dataset.

Step 4, run the command.

7B and 65B speed test scripts have been provided, and you only need to set up the multi-node host name according to the actual hardware environment to run the performance test.

Cd benchmark_65B/gemini_autobash batch12_seq2048_flash_attn.sh for the actual pre-training task, use the same as the speed test, start the corresponding command, such as using 4-node * 8 card to train 65B model.

Colossalai run-nproc_per_node 8-hostfile YOUR_HOST_FILE-master_addr YOUR_MASTER_ADDR pretrain.py-c '65b'-plugin "gemini"-l 2048-g-b 8-an if Colossal-AI gemini_auto parallel strategy is used, it is convenient to realize multi-computer and multi-card parallel training, reduce the consumption of video memory while maintaining high-speed training.

According to the hardware environment or actual requirements, complex parallel strategies such as pipelined parallelism, tensor parallelism and ZeRO1 can be selected.

Among them, through the Booster Plugins of Colossal-AI, users can easily customize parallel training, such as selecting parallel strategies such as Low Level ZeRO, Gemini, DDP and so on.

Gradient checkpointing reduces memory usage by recalculating the model's activation during back propagation.

Speed up computing and save video memory by introducing Flash attention mechanism. Users can easily control dozens of similar custom parameters through command-line parameters, maintaining high performance while maintaining flexibility for custom development.

Colossal-AI 's latest ShardFormer greatly reduces the cost of using multidimensional parallel training LLM.

A variety of mainstream models including LLaMA are now supported, and the Huggingface / transformers model library is natively supported.

Without modifying the model, it can support various configurations of multi-dimensional parallelism (pipelining, tensor, ZeRO, DDP, etc.), and can perform well in all kinds of hardware configurations.

Colossal-AI: Colossal-AI, the new work brought by the large model system infrastructure, is now the star development tool and community under the big model trend.

The above solution of Colossal-AI has been applied in one of the Fortune 500 in the world, and has excellent performance in the thousand-card cluster. It takes only a few weeks to complete the pre-training of the private large model with hundreds of billions of parameters.

Recently released InternLM, such as Shanghai AI Lab and Shangtang, are also based on Colossal-AI to achieve efficient pre-training in keca.

Since open source, Colossal-AI has repeatedly ranked first in the world in the GitHub hot list, won more than 30,000 GitHub Star, and was successfully selected into the official tutorials of the top international AI and HPC conferences, such as SC, AAAI, PPoPP, CVPR, ISC, etc., and hundreds of enterprises have participated in the co-construction of Colossal-AI ecology.

It was developed by James Demmel, Distinguished Professor of the University of California, Berkeley, and you Yang, Youth Professor, President of the National University of Singapore.

Colossal-AI is based on PyTorch and can reduce GPU requirements through efficient multi-dimensional parallelism, heterogeneous memory, etc., focusing on the development and application cost of AI large model training / fine tuning / reasoning.

Lu Chen Technology, the company behind it, has recently received hundreds of millions of yuan in round A financing, and has rapidly completed three rounds of financing within 18 months of its establishment.

Open source address: https://github.com/hpcaitech/ColossalAI

Reference link: https://www.hpc-ai.tech/blog/large-model-pretraining

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.