UC Berkeley LLM quasi-Chinese ranking came, GPT-4 firmly ranked first, Chinese open source RNN model rushed into the top six 04/09 Update SLTechnology News&Howtos

UC Berkeley LLM quasi-Chinese ranking came, GPT-4 firmly ranked first, Chinese open source RNN model rushed into the top six

2025-04-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

Now big language models have to play qualifying like players in games such as Arena of Valor / LoL / Dota!

Some time ago, researchers from LMSYS Org (led by UC Berkeley) made a big news-big language model qualifying!

This time, the team not only brought 4 new players, but also a (quasi) Chinese ranking.

OpenAI GPT-4

OpenAI GPT-3.5-turbo

Anthropic Claude-v1

RWKV-4-Raven-14B (open source)

There is no doubt that as long as GPT-4 takes part in the war, he will firmly occupy the first place.

However, unexpectedly, Claude not only surpassed GPT-3.5, which brought OpenAI to the altar, in second place, but also only 50 points worse than GPT-4.

By contrast, GPT-3.5, which ranks third, is only 72 points higher than Vicuna, the strongest open source model with 13 billion parameters.

With its outstanding performance, the "pure RNN model" RWKV-4-Raven-14B with 14 billion parameters surpassed a group of Transformer models to rank 6th-with the exception of the Vicuna model, RWKV won more than 50% of the non-draw games against all other open source models.

In addition, the team also produced two separate rankings, "English only" and "non-English" (most of which are Chinese).

It can be seen that there are obvious changes in the ranking of many models.

For example, ChatGLM-6B, trained with more Chinese data, did perform better, and GPT-3.5 successfully overtook Claude to second place.

The main contributors to this update are Sheng Ying, Lianmin Zheng, Hao Zhang, Joseph E. Gonzalez and Ion Stoica.

Sheng Ying is one of the three founders of LMSYS Org (the other two are Lianmin Zheng and Hao Zhang) and a doctoral student in the Department of computer Science at Stanford University.

She is also a work of FlexGen, a popular system that can run 175B model reasoning on a single GPU, and has now won 8k stars.

Paper address: https://arxiv.org/ abs / 2303.06865

Project address: https://github.com/ FMInference / FlexGen

Personal homepage: https://sites.google.com/ view / yingsheng / home

"Open Source" VS "closed Source" with the help of the community, the team collected 13k anonymous votes and made some interesting findings.

The gap between proprietary and open source among the three proprietary models, Anthropic's Claude model is more popular than GPT-3.5-turbo.

Moreover, Claude is very competitive when it comes to competing with the most powerful GPT-4.

Judging from the winning rate chart below, Claude has won 32 (48%) of the 66 non-draw games between GPT-4 and Claude.

However, there is still a big gap between the other open source models and these three proprietary models.

In particular, GPT-4 leads the rankings with an Elo score of 1274. This is nearly 200 points higher than Vicuna-13B--, the best open source alternative on the list.

After excluding the draw, GPT-4 won 82% of the games against Vicuna-13B and 79% of the games against the previous generation of GPT-3.5-turbo.

It is worth noting, however, that these open source models on the rankings usually have fewer parameters than proprietary models, ranging from 3 billion to 14 billion.

In fact, recent advances in LLM and data planning have made it possible to make significant performance improvements using smaller models.

Google's latest PaLM 2 is a good example: we know that PaLM 2 achieves better performance than its predecessors when using smaller model sizes.

As a result, the team is optimistic that the open source language model will catch up.

When will GPT-4 "roll over"? In the following figure, the user raises a thorny problem that requires careful reasoning and planning. Although Claude and GPT-4 provide similar answers, Claude's response is slightly better.

However, because of the randomness of sampling, the team found that this situation could not always be replicated. Sometimes GPT-4 can give the same order as Claude, but failed in this generation experiment.

In addition, the team noticed that GPT-4 behaves slightly differently when using the OpenAI API and ChatGPT interfaces, which may be due to different prompts, sampling parameters, or other unknown factors.

An example of how users prefer Claude to GPT-4 in the following figure, Claude and GPT-4 are still struggling to deal with such complex reasoning problems, despite their amazing capabilities.

An example where users think both Claude and GPT-4 are wrong, in addition to these thorny situations, there are many simple problems that do not require complex reasoning or knowledge.

In this case, an open source model like Vicuna can perform as well as GPT-4, so we might be able to replace a more powerful model like GPT-4 with a slightly weaker (but smaller or cheaper) large language model (LLM).

Changes in Elo scores since the participation of three powerful proprietary models, the competition in the chatbot arena has never been so fierce.

Because open source models lost a lot of games against proprietary models, their Elo scores declined.

Finally, the team plans to open up some API so that users can sign up for their own chat robots to participate in qualifying.

Reference:

Https://lmsys.org/blog/2023-05-10-leaderboard/

This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.