Here comes the large model mixer, fusing and outputting the results of 11 AI, produced by the Allen Institute team. 04/22 Update SLTechnology News&Howtos

Here comes the large model mixer, fusing and outputting the results of 11 AI, produced by the Allen Institute team.

2025-04-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

There are so many big models, which one should I use? And generate results from time to time, how to break.

Now there is a way, a move to integrate the expertise of various LLM, performance value full! You don't have to make a choice.

Now through the LLM-Blender large model integration framework, you enter questions, automatically help you sort the LLM results, and then "merge" to produce the best answer.

Just like a fruit juicer, each open source LLM represents different fruits, put in a basketful of them, and squeeze out delicious juice in proportion according to the characteristics of different fruits.

This method can not only effectively reduce the biased errors and uncertain information in a single LLM, but also the output is much higher than the "LLM with the best effect".

Netizens exclaimed: "it's amazing!"

Learn from many strong points, a large model mixer now open source large models so many, when it comes to performance is also very different. Usually we will directly choose the best model to Prompt the results.

However, goose, this method also has some limitations. For example, we cannot consider all possible model outputs, and new data may cause us to choose different best results.

The current method is either to output the answers directly through scoring and comparison, or to merge different LLM answers, but the whole process will still retain harmful or incorrect information, and the output will be greatly reduced accordingly.

Therefore, in order to solve this problem and improve the robustness, generalization and accuracy of LLM, Allen AI Lab jointly published the latest research papers of Southern University and Zhejiang University.

This paper proposes an integrated framework "LLM-Blender", which combines the advantages of several open source large models (LLM), through sorting and fusion to generate two ways, compare the results of different LLM generation, and then integrate the preferred output.

LLM-Blender mainly consists of two modules: "PairRanker" and "GenFuser".

The PairRanker module is a BERT-structured encoder that captures subtle differences between outputs through a two-way attention mechanism, and then ranks candidate outputs.

Here's how PairRanker runs like this.

First, it collects the output of N models for each input and creates all possible output pairs.

Then, use a special encoder to compare these output pairs to determine which candidate output is better and capture small differences.

In the reasoning phase, a matrix is calculated, which contains the comparison results of all output pairs. Then the ranking order of all outputs for a given input is determined according to the matrix.

Finally, the candidate output with the highest ranking in each input is selected as the final result.

The GenFuser module is an encoder-decoder structure based on Transformer, which uses a single encoder to encode the input text and some candidates, and then uses a single decoder to decode the fused output.

It is worth noting that it only puts the top output into the encoder, which not only avoids "noise" infection, but also improves the quality of the output.

To sum up, how LLM-Blender works:

PairRanker compares the output of N LLM and then merges them through GenFuser to produce the best results from the top N output.

According to the explanation of the paper, through this operation process, we can effectively screen and aggregate to generate high-quality answers.

So how effective is it? the following is the team's evaluation process.

The result of integrated LLM is better than that of a single LLM in order to ensure that it can be evaluated on the basis of a large amount of data, the team specially introduced the benchmark data set MixInstruct, which is used to benchmark the integrated model of LLM in the instruction follow task.

The team trained 100000 samples in the dataset, 5000 for verification and 5000 for testing. Then, test 11 popular open source models, such as Vicuna, OpenAssistant, Alpaca, MPT, etc., on these 110000 examples.

According to the performance of multiple LLM models on MixInstruct data sets, we can see that there are significant differences in the performance of different models, each with its own advantages and disadvantages. Among them, Open Assistant, Vicuna and Alpaca are the three best performing models, while Mosaic MPT, StableLM and Flan-T5 rank lower.

Secondly, the experiment also shows that the performance of some excellent models is unstable, and the performance of many other models is better than them.

For example, the average GPT-Rank of Koala is 6.76, but about 40% of the test results show that Koala produces better or equally better results than Open Assistant and Vicuna.

You can also see that in the ranking tool section, PairRanker performs better than BARTScore and GPT-Rank 's best model (Open Assistant).

And the team said that the top three selected from PairRanker were used as candidates for GenFuser. On this basis, LLM-Blender demonstrated the expected excellent performance. Especially in the aspect of GPT-Rank, it reaches 3.01, which obviously exceeds that of the best model Open Assistant (3.90).

They also scored well in BERTScore (79.09), BARTScore (- 3.02) and BELURT (- 0.17).

Experiments show that the quality of the final output of LLM-Blender is significantly improved by ranking and merging the LLM framework.

The team proposes an innovative integrated framework of LLM-Blender, which reduces the weakness of a single LLM through ranking, and integrates advantages through fusion generation to improve the ability of LLM, which can be said to be very novel.

However, this method still has some aspects that can be optimized, such as introducing more language models, using different similarity calculation methods and so on.

The research team paper was published by Allen AI Lab in conjunction with Southern University and Zhejiang University. The three authors are all from China and are related to Southern University (USC).

Jiang Dongfu (Dongfu Jiang), a senior majoring in computer engineering at Zhejiang University, is going to study for an PhD at the University of Waterloo in Canada. he had previously worked as a research intern at USC when he was mentored by two other authors.

Xiang Ren is an associate professor of computer science at the University of Southern University and director of the INK Lab. He studies machine common sense at the Allen Institute of artificial Intelligence and is also a Google scholar.

Lin Yuchen (Bill Yuchen Lin) is currently a young researcher at Allen Institute of artificial Intelligence. He studied computer science at Shanghai Jiaotong University. PhD is also a computer-related major in Southern University.

In addition, scholars from the Mosaic team of the University of Southern University's NK Lab and the Allen Institute of artificial Intelligence are also involved in the project.

Thesis portal:

Https://arxiv.org/abs/2306.02561

Reference link:

[1] https://yuchenlin.xyz/LLM-Blender/#bg

[2] https://twitter.com/billyuchenlin/status/1668666357058277377

This article is from the official account of Wechat: qubit (ID:QbitAI), by Sean.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.