Two lines of code to solve the dialogue limitations of the large language model! Hong Kong Chinese Jia Jiaya team jointly publishes ultra-long text extension technology in conjunction with MIT 04/18 Update SLTechnology News&Howtos

Two lines of code to solve the dialogue limitations of the large language model! Hong Kong Chinese Jia Jiaya team jointly publishes ultra-long text extension technology in conjunction with MIT

2025-04-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

Lost halfway, the model is lazy, and the bigger the context, the more stupid the model. If you experience a large language model product, users will feel some restrictions on the length of the text input. For example, when you want to discuss some longer content with the big model, you need to split the input, and the main points of the previous input will soon be forgotten by the big model.

This is a typical big language model dialogue defect! Like children with congenital attention deficit, it is difficult to concentrate on reading a new book. The key to the defect is the lack of long text processing ability of the model. This situation has now been broken.

Recently, the new technologies and models released by Jia Jiaya's team in conjunction with MIT have quietly entered the hot list of major open source websites: hugging face is the hottest list, paperwithcode is the hottest, Github is the fifth most popular python project, github stars has exceeded 1,000 in a week, and related technical posts on Twitter have been viewed nearly 180000 times.

Github stars has reached 1.3k.

Related technical posts on Twitter have been viewed nearly 180000 times.

The technology, called LongLoRA, is practical but surprisingly simple: it takes only two lines of code and an 8-card A100 machine to expand the text length of the 7B model to 100k tokens,70B model to 32k tokens;. At the same time, the research team also released the first long text dialogue language model LongAlpaca with 70B parameters.

The world's first 70B long text large language model released

The proposal of LongLoRA made it possible to solve the dialogue defects of the global large language model for the first time. Since then, dozens of pages of papers, hundreds of pages of reports, and masterpieces have no longer become blind spots of large models.

In this regard, some professionals excitedly said that LongLoRA is a lamp of hope in the labyrinth of large language models! It represents the industry's rethinking and attention to the long text large language model, effectively expands the context window of the large language model, and allows the model to consider and deal with longer text sequences. it is an innovative invention of the large language model.

In addition to technological innovation, one of the major difficulties of the large language model in dealing with the problem of long text is the lack of open long text dialogue data.

To this end, the research team specially collected 9k long text Q & A corpus pairs, including all kinds of Q & A for classics, papers, in-depth reports and even financial statements.

It is not enough to be able to answer long questions. The team chose a mixture of 3k short Q & A corpus and 9K long Q & A corpus to make the long text large model have the ability of short text dialogue at the same time. This complete dataset, called LongAlpaca-12k, is currently open source.

Based on the LongAlpaca-12k dataset, the research team trained and evaluated different parameter sizes 7B, 13B, 70B, including LongAlpaca-7B, LongAlpaca-13B and LongAlpaca-70B.

Reading novels, revising papers and pointing out the economy can be called the omnipotent king.

Without saying much, blindly select a few demo and take a look at the LongAlpaca effect of the large model superimposed with 12K question and answer corpus with LongLoRA technology.

Let the system read a new paper and propose amendments to it according to the review guidelines of ICLR, so as to improve the acceptance rate of the paper. LongAlpaca's view is that the chances of acceptance of papers will be improved by clarifying novelty more accurately, providing more rigorous and comparable experimental results (including specific data sets and indicators), wider applications and future directions, focusing on key contributions and impacts.

Now, let the system read two new and different papers and ask LongAlpaca to summarize the style differences between the ICLR and CVPR meetings. LongAlpaca concluded that CVPR papers tend to be more structural and experimental, focusing on practicality and technicality. ICLR's paper style is more flexible, focusing on key theoretical analysis and mathematical derivation, rather than the standard format.

It can be seen that the trained LongAlpaca model can easily accept new long academic papers and is quite accurate in answering academic-related questions.

Next, let's take a look at the interpretation of the LongAlpaca model in economic areas with high barriers to reading and comprehension.

According to the summary collection of the global economic outlook of the International Monetary Organization from 2012 to 2023, the economic situation is summarized year by year and the future economic trend is predicted. LongAlpaca tells us that the global economic situation in 2023 is uncertain and growth is expected to slow compared with 2022. In the next few years, the global economic expansion will be mild and will be lower than expected, while major challenges such as geopolitical tensions and financial market volatility remain, which need to be carefully addressed through multilateral cooperation and structural reforms.

LongAlpaca can also read new novels and analyze the content after reading them. The comparison models selected below are LongAlpaca and Llama2 with 13B parameters, depending on their performance.

In the masterpiece Journey to the West, why is it that Sun WuKong is so powerful when he wreaks havoc in Heaven, but he is often frustrated in his way of learning from the classics? LongAlpaca gives five reasons, which can be summarized as "Sun WuKong's immaturity, powerful opponents, limitation of strength, enemy deception, lack of companions and other factors led to his failure in the journey. But with the passage of time, he gained wisdom, experience and comrades-in-arms, so that he finally defeated the most powerful enemy. "

Llama2's answer is relatively simple. "although he did wreak havoc in Heaven, it is not necessarily accurate to say that he represents invincible power. In fact, he was eventually defeated by the powerful Tang Sanzang in Journey to the West." Llama2 didn't give the right answer and said more content information was needed.

This time, let the system read "three-body" for a new time, and then ask Ye Wenjie why she contacted aliens in the first part, and why did she regret it later? LongAlpaca believes that Ye Wenjie first came into contact with aliens out of idealism and revenge, but as time went on, her view of mankind changed and her remorse grew as she realized the danger she might pose to all mankind. The answer is very clear.

Llama2 replied, "by interacting with aliens, she hopes to gain a deeper understanding of the nature of the universe and her place in the universe. Ye Wenjie gradually realized that aliens and their technology will not solve her problems." The answer is general, and then begins to comment on the novel as a whole, which is not the answer to the question.

From the answers given by the model, it can be found that some models, such as Llama2 [2], may have seen the relevant novels in the course of pre-training, but if they ask questions based on the title of the novel, the answer is not ideal.

The answers of the two models are compared. LongAlpaca is good at changing academic papers, commenting on the general trend of the global economy and reading novels, beating Llama2 completely.

Two lines of code and three key conclusions

Llama2 can be said to be one of the most powerful open source models in the AI community, leading the industry, and LongAlpaca can win a complete victory. The LongLoRA technology behind it has successfully attracted the attention of netizens. How on earth did it?

In the original large language model for long text processing, the main computational overhead is focused on the self-attention mechanism (self-attention), and its overhead increases squarely with the length of the text.

To solve this problem, the research team proposed LongLoRA technology, and used grouping and offset to simulate the global self-attention mechanism.

To put it simply, the tokens corresponding to the long text is divided into different groups, and the attention is calculated within each group, while the grouping method is deviated from the different attention head (attention head). This method can not only greatly save the amount of calculation, but also maintain the transmission of the global receptive field.

And this implementation method is also very simple, only two lines of code can be completed!

LongLoRA also explored the way of low-rank training. The original low-rank training methods, such as LoRA [5], can not achieve good results in text length transfer. On the basis of low-rank training, LongLoRA introduces embedded layer (Embedding layer and Normalization layers) for fine-tuning, so as to achieve the effect of full-parameter fine-tuning (Full fine-tune).

When expanding and training texts of different lengths, the specific effects of LongLoRA, LoRA and full-parameter fine-tuning techniques can be seen in three dimensions:

In terms of Perplexity- confusion, the performance of the original LoRA method is deteriorating, while LongLoRA and full-parameter fine-tuning can maintain good results under various text lengths.

In terms of video memory consumption, compared with full-parameter fine-tuning, both LongLoRA and the original LoRA have significant savings. For example, for 8k-length model training, LongLoRA reduces video memory consumption from 46.3GB to 25.6GB compared to full-parameter fine-tuning.

In terms of training time, for the 64k model training, compared with the conventional LoRA,LongLoRA, the training time is reduced from about 90,100h to 52.4h, while the full parameter fine-tuning is more than 1000 hours.

Minimalist training methods, minimal computing resources and time consumption, and excellent accuracy make the large-scale promotion of LongLoRA possible. At present, all the relevant technologies and models are open source, and interested users can deploy their own experience.

It is worth mentioning that this is another effort of Jia Jiaya's team after the "can divide everything" multimodal large model LISA released on August 9. Only two months apart, I have to say, the speed and ability of this research is as amazing as LongLoRA.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.