The first comprehensive evaluation of Llama-2, which competes with open source models at home and abroad. 04/25 Update SLTechnology News&Howtos

The first comprehensive evaluation of Llama-2, which competes with open source models at home and abroad.

2025-04-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

In July 2023, the development of Big language Model (LLM) has entered a new stage, and open source has become a hot topic.

On July 6, Shanghai artificial Intelligence Laboratory and Shangtang Science and Technology jointly released the Scholar Pu language Open Source system (https://github.com/InternLM), which not only opened up the lightweight version of Scholar Puyu (InternLM-7B), but also took the lead in opening up the full chain tool system from data, training to evaluation, and provided a completely free commercial license.

On July 14, Zhisu Technology opened ChatGLM2-6B for free commercial use.

On July 19, Meta opened up the more powerful Llama-2 and offered looser commercial licenses.

In the face of a new wave of open source language models, Turing Award winner Yann Lecun commented on Twitter:

This is going to change the landscape of the LLM market.

However, can the performance of the open source model live up to the eagerness of the industry?

After we got the series of open source models of Llama-2, we conducted a comprehensive evaluation (https://opencompass.org.cn) of it through OpenCompass.

How strong is Llama-2? compared with Llama-1, Llama-2 has many technical improvements, which brings about an effective improvement in model performance, reasoning efficiency and security. Specifically, the important improvements are as follows:

In the model architecture, Group-Query-Attention (GQA) is used to improve the reasoning efficiency of the model, and the context length is doubled from 2K to 4K.

The pre-training material was increased from 1.4T tokens to 2T tokens.

In the supervisory fine-tuning (SFT) phase, we pay more attention to the quality of the dataset, and the effect of using less but higher-quality SFT data is significantly better than using millions of public SFT data.

Three safety training techniques Supervised Safety Fine-Tuning, Safety RLHF and Safety Context Distillation are introduced to improve the security of the model.

Compared with the previous generation, the performance is still not comparable to that of ChatGPT, so what is the overall capability of Llama-2?

Although the test results on about 20 data sets have been shown in the official technical report, the dimension of evaluation ability is still limited and the comparative model is not comprehensive enough.

Here, with the help of the open source evaluation tool OpenCompass, we comprehensively evaluate each model published by Llama-2 on more than 40 evaluation sets, and comprehensively measure the ability of the large model from the five dimensions of discipline, language, knowledge, understanding and reasoning.

The results can be summarized as follows:

The following table lists the performance of Llama, Llama-2, and ChatGPT on several representative evaluation sets:

For more comprehensive and detailed evaluation results, please refer to https://opencompass.org.cn.

Compared with the previous generation model, the overall improvement is as follows:

From the perspective of comprehensive ability, Llama-2-70B (green) is better than Llama-1-65B (purple), and Llama-1 has a significant improvement in language, knowledge, reasoning, understanding, discipline and other ability dimensions. For example, the comprehensive test set MMLU has been promoted from 63.71 to 69.75 GSM8K from 54.51 to 63.46.

The dialogue and the pedestal model are basically the same:

Compared with the pedestal model Llama-2-70B (green), the fine-tuned and aligned model Llama-2-70B-Chat (yellow) has almost the same comprehensive ability, improved performance in language, reasoning and understanding compared with the pedestal, and slightly decreased in discipline comprehensive ability and knowledge ability. For example, on the translation evaluation set Flores and the code evaluation set HumanEval, the Chat model has a relative improvement of more than 40% and 20%, respectively, while on MMLU and TrivialQA, there is a relative decrease of about 10%.

There is still a big gap from ChatGPT:

Compared with ChatGPT-0613 (blue), Llama-2-70B-Chat (yellow) still needs to catch up, especially in reasoning ability, understanding ability, and discipline comprehensive ability. The gap between the mathematical evaluation set MATH and the code evaluation set HumanEval is more than doubled.

The deficiency of Chinese ability is obviously in the training corpus of Llama, Chinese accounts for a relatively small proportion, and the fine-tuning stage is not tuned for Chinese, so the current Llama-2-Chat is still insufficient in Chinese.

A typical performance is that when a Chinese question is given, the model will still answer in English.

In order to have a better understanding of the Chinese and English capabilities of Llama-2, we select the Chinese and English data sets in OpenCompass for separate analysis.

The results show that:

Llama-2 is close to ChatGPT in terms of English language ability, knowledge level and comprehension ability.

Llama-2 lags behind ChatGPT in all aspects of Chinese language ability. This result shows that Llama-2 itself as a base model to directly support Chinese applications is not a particularly good choice.

In terms of reasoning ability, there is still a big gap between Llama-2 and ChatGPT in both Chinese and English. Thus it can be seen that for the large model, the improvement of reasoning ability is much more difficult than that of basic language ability.

Security alignment makes the model overcautious one of the major features of Llama-2 is that it uses a relatively perfect security alignment scheme in the training process, which has a great improvement in value alignment and security.

But in the test, we also found that the balance between the security of Llama-2 and the ability of the model was not very good, and the model was very cautious and refused to reply to many common questions.

Domestic models do not fall behind in recent months, domestic large models have developed rapidly, a number of enterprises and scientific research institutions have released their own large models, including hundreds of billions of parameters of large models.

So how does the domestic model compare with Llama-2? Many friends are concerned about this problem.

Compared with the 70B or higher models released by domestic institutions, heavyweight models are generally not open source, and many models only provide limited services through internal testing API, so it is difficult for us to obtain full evaluation data for many domestic models.

On OpenCompass, the hundreds of billions of parameter Pu language model (InternLM-104B) released by Shanghai artificial Intelligence Laboratory and Shangtang Science and Technology has comprehensive evaluation results.

Based on this result, we compare the performance of scholar Pu language and ChatGPT with Llama-2:

In the comparison of heavyweight models, the scholar Puyu performs well and is ahead of Llama-2 and ChatGPT in most of the mainstream evaluation sets. Specifically, in 43 sets, InternLM-104B surpassed ChatGPT in 34 sets and Llama-2-70B in 41 sets.

Take a big lead in the Chinese language exam:

In the Chinese examination evaluation set CEval and the college entrance examination evaluation set GAOKAO-Bench, InternLM-104B has greatly exceeded Llama2-70B.

Language proficiency has a slight advantage:

InternLM-104B has advantages in both Chinese and English basic language tasks, including word comprehension, idiom, translation and so on, especially in Chinese.

Read and understand the "scholar" worthy of the name:

In all kinds of reading comprehension evaluation sets in both Chinese and English, InternLM-104B shows obvious advantages, and has a better ability to summarize and understand key information from the text paragraphs.

Superior reasoning skills:

InternLM-104B plays a relatively stable role in all kinds of data sets of common sense reasoning, mathematical reasoning and comprehensive reasoning, and has certain advantages compared with Llama2-70B.

Knowledge questions and answers are equally divided:

In the knowledge question and answer evaluation set such as BoolQ,CommonSenseQA,TrivialQA,NaturalQuestion, the performance of the two models is similar, which shows that there is no significant difference in the level of knowledge.

Code capabilities win or lose each other:

The code capabilities of InternLM-104B and Llama2-70B are equal, and there are winners and losers on HumanEval and MBPP data sets.

The comparison of lightweight models is catching up with each other on the heavyweight track, and on the 7B lightweight track, the competition for open source models is also very active.

Among the many domestic open source models, excellent models such as Baichuan-7B released by Baichuan Intelligence, ChatGLM2-6B released by Tsinghua University and Zhisu AI, and InternLM-7B released by Shanghai artificial Intelligence Laboratory have attracted wide attention in the industry.

We compared these domestic models with Llama-2-7B in all aspects:

The following table lists the performance of these 7B models on several representative evaluation sets:

The results show that Llama-2 has obvious advantages in knowledge and ability.

But in terms of discipline, language, reasoning and comprehension, InternLM and ChatGLM2 have surpassed Llama-2, and InternLM has a clear lead.

A few months ago, Llama's open source exploded the community, benefiting many developers and researchers and spawning the entire alpaca family, but unfortunately its agreement restricts commercial use and shuts out businesses.

On July 6, at the World artificial Intelligence Congress, the open source system of Scholar Puyu was officially released, which opened up InternLM-7B and provided free commercial licenses.

Since then, open source models such as ChatGLM2-6B and Llama2 have promoted free commercial use, which conforms to the development trend and the voice of the community.

It is believed that a single spark in the open source community will start a prairie fire to the industry and further reduce the threshold for the landing application of large models.

* this article is a qubit authorized publication and the views are owned by the author only.

-end-

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.