Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

The model evaluation of the University of Science and Technology has been announced, taking into account both arts and science, making it into the first echelon of Chinese closed-source model | SuperCLUE

2025-01-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Share

Shulou(Shulou.com)11/24 Report--

The results of this evaluation can only be used for academic research.

A brief introduction to the Model of the University of Natural Science and Technology

The Tiangong University model, a 100 billion-level large language model developed by Kunlun Wanwei, was first tested internally on April 17 this year. Recently, the CLUE community found that the Tiangong large model v3.5 outperforms GPT-3.5 and LLaMA2-70B in multiple evaluation data sets, especially in the reasoning evaluation set GSM8K, which has aroused heated discussion among many developers in the CLUE community.

So, can the model of Tiangong University perform well in our evaluation set? How does it compare with the representative models developed by large domestic and foreign factories and scientific research institutions, and how does it perform in some concerned abilities, such as generation and creation, logical reasoning, code generation? Based on the SuperCLUE comprehensive evaluation benchmark, including multi-round open-ended question assessment SuperCLUE-OPEN and three major ability objective question assessment SuperCLUE-OPT, we used 3337 questions to conduct an all-round evaluation of the Tiangong University model.

Evaluation environment

Reference standard: SuperCLUE comprehensive evaluation benchmark

Evaluation model: Tiangong University model v3.5.20230915.a

Evaluation set: a total of 3337 Chinese questions, including 623 short answers and 2714 multiple choice questions. It includes 74 evaluation tasks in three dimensions: basic ability, academic major and Chinese characteristics.

Model GenerationConfig configuration:

Generate_length: 2048

Repetition_penalty: 1

Temperature: 0.8

Top_k: 3

Top_p: 1

Evaluation method: this evaluation is an automatic evaluation, the specific evaluation scheme can refer to the SuperCLUE comprehensive evaluation standard. This evaluation has been checked by manual sampling.

Let's talk about the conclusion first.

Conclusion 1: on the SuperCLUE benchmark, the Tiangong University model is in the first echelon of the Chinese closed source model in terms of comprehensive ability, which is a very competitive model.

Conclusion 2: the Tiangong University model further narrows the gap between the Chinese closed source model and the GPT3.5.

Conclusion 3: the model of Tiangong University is a large model with balanced ability, there is no obvious deficiency in each task, and it is more prominent in language comprehension, calculation and logical reasoning.

The following is the evaluation and analysis of the model from both quantitative and qualitative perspectives.

Evaluation and analysis

1. Quantitative analysis

We refer to the representative models of the August SuperCLUE list at home and abroad to compare the performance of the Tiangong University model.

Comprehensive evaluation of SuperCLUE large model

Note: total score = 50% open 50% OPT

Through the evaluation results, we can see that the Tiangong University model performs well in the domestic closed-source model in the August superclue evaluation set. The performance of the Model of Natural Science and Technology University in the Ten basic abilities

Note: the top ten ability scores are the weighted average of OPEN score and OPT score.

Through the evaluation results, we can see that the Tiangong University model is very balanced in the ten major tasks, and by comparing with the average score of the Chinese closed-source model, it is found that the Tiangong University model is above the average in all tasks, which is relatively rare in the current Chinese model.

Summary:

From the evaluation data, we find that the model of Tiangong University is very eye-catching in terms of comprehensive ability, and is in the first echelon of domestic big models in total score, and the ability is very balanced without obvious deficiency. It performs well in language understanding, generation, creation, calculation and logical reasoning. It should be noted that the Tiangong University model for this evaluation is the updated version on September 15, and the other domestic models compared are the evaluation results for August, and the subsequent SuperCLUE will be further compared.

2. Qualitative analysis through some typical examples, the characteristics of the natural engineering model are compared and analyzed.

Logic and reasoning

In the first round of questions in this example, both models are answered correctly. Among them, the answer of gpt-3.5-turbo is relatively simple, and the reasoning steps of Tiangong model are relatively complete. In the second question, gpt-3.5-turbo failed to give a correct answer, and although it also included some factors that might affect Mrs. Wang's speech, it did not specify that this was because February in a leap year had 29 days. The model of Tiangong University gives the answer to the question directly. Tiangong performs better in this example.

Calculate

In this example, both models give the correct first-order and second-order derivatives, and explain the calculation process in detail, so the performance of the two models is equal in terms of correctness and calculation process.

Generation and creation

Both models answer users' questions well. The answer of the Tiangong University model pays more attention to the philosophy of life and emphasizes the power and meaning of hope. Gpt-3.5-turbo 's answer is more like an actual survival story. In terms of practicality, relevance, accuracy, depth and creativity, the answers of both models are good.

Language comprehension and extraction

In this example, the Tiangong University model recognizes that each link in the text contains positive emotional elements. Its answers are in-depth, accurate, and directly respond to users' questions. Gpt-3.5-turbo 's answer adopts a step-by-step analysis, identifying the negative emotions at the beginning of the text, and then gradually pointing out the emergence and dominance of positive emotions. This answer is also accurate and detailed, but provides more steps and details. So taken together, the answers of both models are very good.

Summary:

From the example of qualitative analysis, we can find that several key basic capabilities of the model are very close to gpt-3.5-turbo, especially in logic, reasoning and calculation.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

IT Information

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report