Shanghai AI Lab publishes the "Scholar Pu language" Model: Chinese Test surpasses ChatGPT 02/11 Update SLTechnology News&Howtos

Shanghai AI Lab publishes the "Scholar Pu language" Model: Chinese Test surpasses ChatGPT

2026-02-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

Thanks to CTOnews.com netizens Wu Yanzu in South China and Mr. Aviation for their clue delivery! CTOnews.com reported on June 7 that CTOnews.com learned from official account, the Shanghai artificial Intelligence Laboratory, that on June 7, the Shanghai artificial Intelligence Laboratory (Shanghai AI Lab), Shangtang Science and Technology, together with the Chinese University of Hong Kong, Fudan University and Shanghai Jiaotong University, released the 100-billion-level parametric large language model "Scholar Puyu" (InternLM). " The Scholar Pu language has 104 billion parameters and is trained on a high-quality multilingual data set containing 1.6 trillion token.

According to Tuyuan Pexels's Shanghai artificial Intelligence Lab, the comprehensive evaluation results show that Scholar Puyu not only performs well in many test tasks, such as knowledge mastery, reading comprehension, mathematical reasoning, and multilingual translation, but also has strong comprehensive abilities, so it performs well in comprehensive exams, surpassing ChatGPT in a number of Chinese exams, including the data set (GaoKao) of various subjects in the Chinese National College entrance examination.

According to reports, the joint team of "Scholar Puyu" selected more than 20 tests to test it, including four of the most influential comprehensive test sets in the world: MMLU;, a multitasking test set built by universities such as the University of California, Berkeley, and AGIEval (including Chinese college entrance examination, judicial examination and American SAT, LSAT, GRE and GMAT, etc.). CmurEval, a comprehensive examination evaluation set for Chinese language model jointly constructed by Shanghai Jiaotong University, Tsinghua University and Edinburgh University, and Gaokao, a college entrance examination question evaluation set constructed by the research team of Fudan University.

The joint team of the laboratory conducted a comprehensive test on "Scholar Puyu", GLM-130B, LLaMA-65B, ChatGPT and GPT-4. The results of the above four evaluation sets are compared as follows (out of 100).

It can be seen that Scholar Puyu not only significantly surpasses the academic open source models such as GLM-130B and LLaMA-65B, but also leads ChatGPT; in many comprehensive exams, such as AGIEval, C-Eval and Gaokao, and achieves the same level as ChatGPT on MMLU, which is dominated by American examinations. The results of these comprehensive examinations reflect the solid knowledge and excellent comprehensive ability of "Scholar Pu language".

Although Scholar Puyu has achieved excellent results in the test evaluation, it can also be seen that there are still many ability limitations in the large language model. Scholar Puyu is limited to 2K context window length (GPT-4 context window length is 32K), and has obvious limitations in long text comprehension, complex reasoning, code writing and mathematical logic deduction. In addition, in the actual dialogue, there are common problems such as hallucinations and conceptual confusion in the large language model, which make the use of the large language model in open scenarios still have a long way to go.

The results of four comprehensive test evaluation data sets: MMLU is a multi-task test evaluation set built by the University of California, Berkeley (UC Berkeley), Columbia University, the University of Chicago and UIUC Public, covering elementary mathematics, physics, chemistry, computer science, American history, law, economics, diplomacy and other disciplines. The results of breaking down the accounts are shown in the table below.

(bold indicates the best result, underscore indicates the second.) AGIEval is a new subject examination evaluation set proposed by Microsoft Research this year. The main goal is to evaluate the ability of language models through oriented tests, so as to achieve the comparison between model intelligence and human intelligence. The evaluation set consists of 19 major tests based on various examinations in China and the United States, including the Chinese college entrance examination and judicial examination, as well as important tests such as SAT, LSAT, GRE and GMAT in the United States. It is worth mentioning that nine of the 19 major items are the Chinese college entrance examination, which is usually listed as an important subset of the evaluation AGIEval (GK). In the following table, the subject with GK is the subject of Chinese college entrance examination.

C-Eval is a comprehensive test set for Chinese language model, which is jointly constructed by Shanghai Jiaotong University, Tsinghua University and Edinburgh University. It contains nearly 14000 examination questions in 52 subjects, covering exams in mathematics, physics, chemistry, biology, history, politics, computer science and other subjects, as well as professional examinations for civil servants, certified public accountants, lawyers and doctors. The test results can be obtained through leaderboard.

Gaokao is a comprehensive examination evaluation set based on Chinese college entrance examination questions, which is constructed by the research team of Fudan University, including various subjects of Chinese college entrance examination, as well as selection, filling in the blanks, question and answer, and so on. In the GaoKao evaluation, Scholar Puyu leads ChatGPT in more than 75% of the projects.

Itemized evaluation: excellent reading comprehension and reasoning ability in order to avoid "partial subject", the researchers also evaluated and compared the itemized ability of language models such as "Scholar Pu language" through a number of academic evaluation sets. The results show that Scholar Puyu not only performs well in reading comprehension in Chinese and English, but also achieves good results in mathematical reasoning and programming ability evaluation.

In the aspect of knowledge question and answer, the scores of "Scholar Puyu" in TriviaQA and NaturalQuestions are 69.8 and 27.6, which are higher than those of LLaMA-65B (68.2 and 23.8).

In terms of reading comprehension (English), Scholar Puyu is obviously ahead of LLaMA-65B and ChatGPT. The scores of Pu language in junior high school and senior high school English reading comprehension are 92.7and 88.9, respectively, while the score of ChatGPT is 85.6and 81.2 respectively, while the score of LLAMAMI 65B is lower.

In terms of Chinese comprehension, Scholar Puyu surpasses the two main Chinese language models ERNIE-260B and GLM-130B in an all-round way.

In terms of multilingual translation, the average score of "Scholar Pu language" in multilingual translation is 33.9, significantly higher than that of LLaMA (average score 15.1).

In terms of mathematical reasoning, Scholar Puyu scored 62.9 and 14.9 in GSM8K and MATH, respectively, which were significantly ahead of PaLM-540B (56.5,8.8) and LLaMA-65B (50.9 and 10.9) of Google.

In terms of programming ability, Scholar Puyu scored 28.1 and 41.4 respectively in the two most representative tests of HumanEval and MBPP (after fine-tuning in the field of code, the score on HumanEval can be raised to 45.7), which is significantly ahead of PaLM-540B (26.2 and 36.8) and LLaMA-65B (23.7 and 37.7).

In addition, the researchers also evaluated the safety of "Scholar Puyu". In TruthfulQA (mainly evaluating the factual accuracy of answers) and CrowS-Pairs (mainly evaluating whether the answers are biased), Scholar Puyu reached the leading level.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.