Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

UC Berkeley released the ranking of big language models, Vicuna won the championship, Tsinghua ChatGLM entered the top 5

2025-02-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Share

Shulou(Shulou.com)11/24 Report--

Unexpectedly, now the big language models will play qualifying like the players in Arena of Valor / LoL / Dota games! It is said that those closed-source models will also be pulled out soon.

Recently, researchers from LMSYS Org (led by UC Berkeley) made another big news-the big language model qualifying race!

As the name implies, "LLM qualifying" is to have a group of large language models battle randomly and rank them according to their Elo scores.

Then, we can see at a glance whether a chatbot is "the king of the mouth" or "the strongest king".

Key points: the team also plans to bring in these "closed-source" models both at home and abroad, whether it is a mule or a horse! (GPT-3.5 is now in the anonymous arena.)

The chief of the anonymous chat robot arena goes like this:

Obviously, Model B got the right answer and won the game, while Model A didn't even understand the question.

Project address: https://arena.lmsys.org/ in the current ranking, Vicuna with 13 billion parameters ranks first with 1169 points, Koala with the same 13 billion parameters ranks second, and Open Assistant of LAION ranks third.

The ChatGLM proposed by Tsinghua University, although it has only 6 billion parameters, is still in the top five, only 23 points behind the Alpaca with 13 billion parameters.

By contrast, the original Meta LLaMa ranked eighth (second from the bottom), while Stability AI's StableLM scored a unique 800 +, ranking first from the bottom.

The team said that not only will the ranking list be updated regularly, but also the algorithm and mechanism will be optimized and more detailed rankings will be provided according to different task types.

At present, all evaluation codes and data analysis have been released.

Pull LLM to play ranked in this evaluation, the team selected nine well-known open source chat robots.

Every time 1v1 plays, the system randomly pulls two players to compete. Users need to chat with the two robots at the same time and then decide which chatbot can chat better.

As you can see, there are four options at the bottom of the page, the left (A) is better, the right (B) is better, just as good, or all bad.

When the user submits the vote, the system displays the name of the model. At this point, users can continue to chat, or choose a new model to start a new round of battle.

However, when analyzing, the team will only use the voting results when the model is anonymous. After almost a week of data collection, the team received a total of 4.7k valid anonymous votes.

Before starting, the team mastered the possible rankings of each model based on the results of the benchmark.

According to this ranking, the team will let the model give priority to choosing a more suitable opponent.

Then, through uniform sampling, to get a better overall coverage of the rankings.

At the end of qualifying, the team introduced a new model fastchat-t5-3b.

These operations eventually lead to non-uniform model frequency.

The number of battles per model combination from a statistical point of view, most users are in English, Chinese ranked second.

It's really hard to evaluate the number of battles in the top 15 languages, LLM, and open source language models with fine-tuning instructions have sprung up like bamboo shoots after a spring rain since the ChatGPT explosion. It can be said that new open source LLM is released almost every week.

But the problem is that it is very difficult to evaluate these large language models.

Specifically, at present, the things used to measure the quality of a model are basically based on some academic benchmark, such as building a test data set on a NLP task, and then looking at the accuracy of the test data set.

However, these academic benchmark (such as HELM) do not work well on large models and chatbots. The reasons are as follows:

1. Because it is very subjective to judge whether a chatbot is good or not, it is difficult to measure it by existing methods.

two。 These large models scan almost all the data on the Internet during training, so it is difficult to ensure that the data set used in the test has not been seen. Even further, using the test set to "train" the model directly, it is bound to perform better.

3. In theory, we can talk to chatbots about anything, but many topics or tasks don't exist in the existing benchmark.

So if you don't want to use these benchmark, there is actually another way to go-- pay someone to grade the model.

In fact, that's what OpenAI does. But this method is obviously slow and, more importantly, too expensive.

In order to solve this thorny problem, the team from UC Berkeley, UCSD and CMU invented a new mechanism that is both fun and practical-the chatbot Arena (Chatbot Arena).

In comparison, the war-based benchmark system has the following advantages:

Scalability (Scalability)

When sufficient data cannot be collected for all potential model pairs, the system should be able to extend to as many models as possible.

Incremental (Incrementality)

The system should be able to evaluate the new model with a relatively small number of trials.

Unique order (Unique order)

The system should provide a unique order for all models. Given any two models, we should be able to determine which ranking is higher or whether they are juxtaposed.

Elo scoring system Elo grading system (Elo rating system) is a method to calculate the relative skill level of players, which is widely used in competitive games and all kinds of sports. Among them, the higher the Elo score, the better the player.

Such as League of Legends, Dota 2 and chicken eating, etc., this is the mechanism that the system ranks players.

For example, when you play a lot of qualifying games in League of Legends, there will be a hidden point. This hidden score not only determines your Rank, but also determines that the opponent you encounter when you play ranked is basically at a similar level.

Moreover, the value of this Elo score is absolute. In other words, when we add new chatbots in the future, we can still judge which chatbot is better by the score of Elo.

Specifically, if player A's score is Ra and player B's score is Rb, the exact formula for player A's probability of winning (using the logistic curve with 10 as the base) is:

Then, the player's score will be updated linearly after each game.

Suppose player A (rated Ra) expects to get an Ea score, but actually gets a Sa score. The formula for updating the player's score is:

1v1 winning rate in addition, the author also shows the victory rate of each model in qualifying and the predicted victory rate estimated by Elo score.

The results show that the Elo score can be predicted relatively accurately.

The proportion of Model A's victory in all non-draw A versus B battles

In the A-to-B battle, the author's introduction to the chatbot Arena, which uses the Elo score to predict the winning rate of Model A, was released by the former alpaca author organization LMSYS Org.

Founded by Dr. UC Berkeley Lianmin Zheng and Associate Professor Hao Zhang UCSD, the organization aims to make large models available to everyone by jointly developing open data sets, models, systems and evaluation tools.

Lianmin ZhengLianmin Zheng is a doctoral student in the EECS Department at the University of California, Berkeley. His research interests include machine learning systems, compilers, and distributed systems.

Hao ZhangHao Zhang is currently a postdoctoral fellow at the University of California, Berkeley. He will be an assistant professor at the Hal data Science Institute and the Department of computer Science at the University of California, San Diego, starting in the fall of 2023.

Reference:

Https://lmsys.org/blog/2023-05-03-arena/

This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

IT Information

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report