PyTorch built a big model "acceleration package" with less than 1000 lines of code 10 times faster! Nvidia scientist: one of the best teaching programs since minGPT repo 10/21 Update SLTechnology News&Howtos

PyTorch built a big model "acceleration package" with less than 1000 lines of code 10 times faster! Nvidia scientist: one of the best teaching programs since minGPT repo

2025-10-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)12/24 Report--

The PyTorch team made large model reasoning 10 times faster. And only uses less than 1000 lines of pure native PyTorch code!

The project is called GPT-fast, and the look and feel of the acceleration effect goes like this:

Unobstructed, it is unobstructed!

The point is that the team released the code and detailed "tutorials" directly. It is still the kind of simple strokes, which is easy to understand.

@ Horace He, a member of the development team, said:

We don't think of it as a library or framework, but rather hope that people can use it as an example to "copy and paste" according to their own needs.

Netizens directly fried the pot, Nvidia AI scientist Jim Fan commented:

This is one of the best instructional programs repo since Andrej Karpathy released minGPT!

The open source world needs more projects like minGPT and GPT-Fast!

So how on earth did GPT-fast speed up the big model?

In general, the "acceleration package" of the open-box model uses these methods:

Torch.compile: a compiler specially designed for the PyTorch model, which can improve the efficiency of the model.

GPU quantization: speed up the operation of the model by reducing the accuracy of the calculation.

Speculative decoding: use a smaller model to predict the output of a larger model, so as to speed up the operation of the large language model.

Tensor parallelism: speed up processing by distributing the operation of models across multiple hardware devices.

Let's expand them one by one.

The development team initially used a simple PyTorch to implement it, but it didn't work well (25.5 tok / s):

When they looked at the trace, they found that one reason was that reasoning performance was limited by too much CPU.

So how to solve it?

You can imagine a scenario where GPU is a huge factory (with a lot of available computing power), while CPU is a trolley that "supplies" the factory back and forth.

In many cases, CPU cannot "feed" GPU fast enough.

Therefore, the development team suggested that GPU be given more work, or a larger "block" of tasks at one time.

To do this in the reasoning process, torch.compile can be introduced.

Torch.compile can capture a larger area of the model and compile it into a single compilation region. It is very effective in reducing CPU overhead, especially when running in "reduce-overhead" mode.

The effect was immediate, with a direct 4x improvement in performance, from 25 tok / s to 107 tok / s:

Next, the development team wanted to speed up further, but encountered memory bandwidth bottlenecks.

The development team calculated the bandwidth utilization of the model, and the result has reached 72%:

In other words, the room for further speed improvement may be limited.

Reviewing the above equation, the team found that although you can't actually change the number of model parameters or the memory bandwidth of GPU (at least without spending more money), you can change the number of bytes used to store each parameter.

This means that although you cannot resize the model or upgrade the hardware to improve performance, you can improve efficiency by reducing the amount of data required to store the model parameters.

This is usually achieved through quantization techniques, that is, reducing the number of bits required to represent each parameter.

As a result, the development team introduced the next technology-int8 quantization.

The use of int8 weight quantization reduces memory load and further improves performance (157.4 tok / s):

There is one more problem after using quantization: to generate 100 token, you must load (or invoke) the model weight 100 times. Frequent loading of model weights can also lead to inefficiency.

At first glance, there seems to be no solution, because there is a strict sequence dependency in the autoregressive generation model.

However, the development team pointed out that this strict sequence dependency can be broken by using speculative decoding.

For example, imagine a senior engineer, Verity, who is always right in technical decisions but relatively slow to write code.

At the same time, there is a junior engineer, Drake, who, in contrast to Verity, is not good at technical decision-making, but writes code faster and cheaper.

So how to make use of the advantages of different people to improve the overall efficiency?

The method is simple: let Drake write the code and make technical decisions in the process. Next, hand over the code to Verity for review, and let Drake redo what is wrong.

In Transformer model reasoning, the large verification model is the Verity role, while Drake is a smaller draft model that can generate text more quickly.

The development team uses the draft model to generate eight token, and then uses the validation model to process them in parallel, discarding the mismatches.

As a result, the serial dependency is broken and the speed is improved again.

It is worth mentioning that speculative decoding does not change the quality of the output. This approach is effective as long as it takes less time to generate these token using the draft model to generate token + than to generate these token alone.

And implementing this technique using native PyTorch is actually very simple, requiring only about 50 lines of native PyTorch code.

Because AMD also supports Triton and torch.compile backends, all optimizations previously applied on Nvidia GPU can also be reapplied on AMD GPU.

The development team observed an acceleration in int8 quantification from 22 tok / s to 102 tok / s:

After that, the development team used int4 quantization to further improve the speed, but the accuracy of the model decreased.

Therefore, grouping quantization and GPTQ reduction weight are used.

Finally, on the premise of ensuring accuracy, the speed is increased to 202.1 tok / s:

Combine the above technologies to achieve a higher speed of 244.7 tok / s:

So far, the R & D team has been speeding up on a single GPU. But in many cases, it is possible to use multiple GPU.

The use of multiple GPU can increase the memory bandwidth, thus improving the overall performance of the model.

When choosing a parallel processing strategy, it is necessary to split a token process on multiple devices, so tensor parallelism is needed.

PyTorch also provides an underlying tool for tensor parallelism, which can be used in conjunction with torch.compile.

The development team also revealed that a higher level of API is also being developed to express tensor parallelism.

However, even without a higher level of API, it is easy to add tensor parallelism, 150 lines of code, without any changes to the model.

All the optimizations mentioned earlier can be combined with tensor parallelism. Combined with these optimizations, int8 quantization can be provided for Llama-70B at a speed of 55 tokens / s.

Finally, the results are summarized and quantization is ignored. Fast reasoning, speculative decoding and tensor parallelism are realized with only 766 lines of code (model.py 244lines, generate.py 371lines, tp.py 151lines).

For Llama-7B, the speed of quantization + speculative decoding using compile+int4 is 241 tok / s. For Llama-70B, 80 tok / s is achieved by adding tensor parallelism.

These performances are close to or exceed the current SOTA.

Reference link:

[1] https://pytorch.org/blog/accelerating-generative-ai-2/?utm_content=273712248&utm_medium=social&utm_source=twitter&hss_channel=tw-776585502606721024

[2] https://twitter.com/DrJimFan/status/1730298947376443698

[3] https://twitter.com/cHHillee/status/1730293330213531844

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.