With the advent of the StreamingLLM framework, it claims to "allow large models to handle infinite length text" 04/20 Update SLTechnology News&Howtos

With the advent of the StreamingLLM framework, it claims to "allow large models to handle infinite length text"

2025-04-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

CTOnews.com, October 6 (Xinhua) researchers at the Massachusetts Institute of Technology (MIT) and Meta AI have developed a framework called StreamingLLM, which proposes a series of solutions to the RAM and generalization problems that large language models may encounter, claiming to "allow language models to handle infinite length text content."

The research focus of ▲ image source GitHubStreamingLLM is to solve the obstacles to the implementation of streaming language model (Efficient Streaming Language Models,ESLM), especially the problems that may arise in "multi-round dialogue scenarios with long-term interaction".

The researchers point out that there are two main challenges to this streaming language model:

The first challenge: getting the Key value (Value) state of token during the decoding phase consumes a lot of RAM.

The second challenge: the current popular large language model is difficult to generalize long texts that "exceed the length of the training sequence".

CTOnews.com noted that in the past, many studies have tried to solve the above challenges, such as "expanding the attention window" to enable language models to deal with long text that exceeds the length of the pre-training sequence, or to create a fixed-size active window that only focuses on the recent token key state to ensure that RAM usage and decoding speed remain stable, but this strategy will fail when the "sequence length exceeds the cache size".

The biggest challenge of the current streaming language model is "how to deal with long text input without consuming too much RAM and without harming the performance of the model".

The strategy adopted by StreamingLLM is to "use the phenomenon of attention sinking". The researchers observed that in the autoregressive language model, regardless of the correlation between the specific token and the language model itself, if a lot of attention is allocated to the generation token. These token that get a high degree of attention will show the phenomenon of attention sinking, and even if these token are not semantically important, they still get strong attention from the model (that is, they give a lot of attention to a particular token content to get most of the attention of the model, and these specific token content contains "the key value of the sinking token", thus ensuring that the model's attention calculation is stable no matter how long the input sequence is.

The important contribution of ▲ graph source GitHubStreamingLLM is that it proposes a simple and efficient solution so that the language model can deal with infinite length text without fine-tuning. So as to solve the dilemma of the current language model in streaming application. Although the streaming language model is imperative in the future, the development of related models is still challenged due to the limitation of RAM efficiency and the performance of the model in dealing with long sequences.

The research team confirmed that StreamingLLM allows Llama 2, MPT, Falcon, and Pythia to process up to 4 million token of text reliably, providing more deployment possibilities for streaming language models.

Referenc

Efficient Streaming Language Models with Attention Sinks

Mit-han-lab / streaming-llm

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.