Transformer successors are molded! MSRA proposes a new large model infrastructure: 8 times faster reasoning and 70% less memory footprint 04/15 Update SLTechnology News&Howtos

Transformer successors are molded! MSRA proposes a new large model infrastructure: 8 times faster reasoning and 70% less memory footprint

2025-04-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

Microsoft big model new architecture, officially launched a challenge to Transformer! The title of the paper clearly reads:

Retentive Network (RetNet): the successor to Transformer in the large model domain.

This paper proposes a new Retention mechanism to replace Attention. Researchers from Microsoft's Asian Research Institute and Tsinghua University make no secret of their "ambition" and speak boldly:

RetNet achieves good extension results, parallel training, low-cost deployment and efficient reasoning.

These features make this infrastructure a powerful successor to Transformer in the large language model.

The experimental data also show that in the task of language modeling:

RetNet can achieve the same degree of confusion as Transformer (perplexity)

The speed of reasoning is 8.4 times.

70% reduction in memory footprint

Has good expansibility.

And when the model size is larger than a certain size, the performance of RetNet will be better than that of Transformer.

Does Transformer really have a model to follow? Let's take a look at the details.

There is no doubt about the importance of solving the "impossible triangle" Transformer in the large language model. Whether it is OpenAI's GPT series, Google's PaLM, Meta's LLaMA, are all based on Transformer.

But Transformer is not perfect: its parallel processing mechanism is at the cost of inefficient reasoning, and the complexity of each step is O (N); Transformer is a memory-intensive model, the longer the sequence, the more memory it takes.

Before that, it is not that everyone has thought about continuing to improve Transformer. But some of the main research directions are focused on one thing at the expense of the other:

Linear attention can reduce the cost of reasoning, but its performance is poor.

Cyclic neural networks can not be trained in parallel.

In other words, these neural network architectures are faced with an "impossible triangle". The three corners represent parallel training, low-cost reasoning and good scalability.

What the RetNet researchers want to do is to make the impossible possible.

Specifically, RetNet uses multi-scale persistence (retention) mechanism to replace the standard self-attention mechanism on the basis of Transformer.

Compared with the standard self-attention mechanism, the retention mechanism has several characteristics:

The position-dependent exponential attenuation term is introduced to replace the softmax, which simplifies the calculation and retains the information of the previous step in the form of attenuation.

The plural space is introduced to express the position information instead of absolute or relative position coding, and it is easy to be converted into recursive form.

In addition, the retention mechanism uses multi-scale decay rate to increase the expression ability of the model, and uses the scaling invariance of GroupNorm to improve the numerical accuracy of the retention layer.

The dual representation of ▲ RetNet that each RetNet block contains two modules: a multi-scale holding (MSR) module and a feedforward network (FFN) module.

The retention mechanism supports the representation of sequences in three forms:

Parallelism

Recursion

Block recursion, that is, the mixed form of parallel representation and recursive representation, divides the input sequence into blocks, calculates according to the parallel representation within the block, and follows the recursive representation between blocks.

Among them, parallel representation enables RetNet to use GPU for parallel training as efficiently as Transformer.

The recursive representation realizes the reasoning complexity of O (1) and reduces the memory occupation and delay.

Block recursion can deal with long sequences more efficiently.

In this way, RetNet makes the "impossible triangle" possible. The following is a comparison of RetNet with other infrastructures:

The experimental results on the language modeling task further prove the effectiveness of RetNet.

The results show that RetNet can achieve a degree of confusion similar to that of Transformer (PPL, an index for evaluating language models, the smaller the better).

At the same time, when the model parameter is 7 billion and the length of input sequence is 8k, the reasoning speed of RetNet is 8.4 times that of Transformer, and the memory footprint is reduced by 70%.

During the training process, RetNet also performs better than the standard Transformer+FlashAttention in terms of memory savings and acceleration, reaching 25-50% and 7 times, respectively.

It is worth mentioning that the reasoning cost of RetNet is independent of sequence length, and the reasoning delay is insensitive to batch size, allowing high throughput.

In addition, when the size of the model parameters is greater than 2 billion, the performance of RetNet is better than that of Transformer.

Research team RetNet's research team is from Microsoft Asian Research Institute and Tsinghua University. Common as Sun Yutao and Dong Li.

Sun Yutao, a bachelor's degree in computer science from Tsinghua University, is now an intern at Microsoft Asian Research Institute.

Dong Li is a researcher at Microsoft Asian Research Institute. He is also one of the authors of the paper "Transformer that can remember 1 billion token" that has attracted a lot of attention.

The correspondent author of the RetNet paper is Wei Furu. He is a global research partner of Microsoft Research Asia, and 1 billion token Transformer is also from his research team.

Paper address:

Https://arxiv.org/abs/2307.08621

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.