Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

AI is just around the corner, and Microsoft's Chinese team has released a new benchmark AGIEval, which is designed for human exams.

2025-02-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Share

Shulou(Shulou.com)11/24 Report--

How many offer can I get if I let AI take the law exam, the college entrance examination and the civil service exam?

With the increasing ability of language models, the existing assessment benchmarks are really a little bit pediatric, and the performance of some tasks lags behind that of human beings.

An important feature of general artificial intelligence (AGI) is that the model has the generalization ability to deal with human-level tasks, while the traditional benchmark tests that rely on artificial data sets can not accurately represent human capabilities.

Recently, Microsoft researchers released a new benchmark AGIEval designed to assess the performance of basic models in standardized "people-oriented" (human-centric) tests such as gaokao, civil service exams, law school entrance exams, math competitions and bar exams.

Links to papers: https://arxiv.org/ pdf / 2304.06364.pdf

Data link: https://github.com/ microsoft / AGIEval

The researchers used the AGIEval benchmark to evaluate the three most advanced basic models, including GPT-4, ChatGPT and Text-Davinci-003. The results showed that GPT-4 outperformed the human average in SAT, LSAT and math competitions. The accuracy of the SAT math test reached 95%, and the accuracy of the Chinese college entrance examination English test reached 92.5%, indicating the extraordinary performance of the current basic model.

However, GPT-4 is not proficient in tasks that require complex reasoning or domain-specific knowledge. A comprehensive analysis of model capabilities (understanding, knowledge, reasoning and computing) reveals the advantages and limitations of these models.

AGIEval dataset in recent years, large-scale basic models such as GPT-4 have shown very powerful capabilities in various fields, which can assist human beings to deal with daily events, and even provide decision-making advice in legal, medical, financial and other professional fields.

In other words, artificial intelligence system is gradually approaching and realizing general artificial intelligence (AGI).

However, as AI is gradually integrated into daily life, it is very important to evaluate the generalization ability of models, identify potential defects, and ensure that they can effectively deal with complex, people-oriented tasks, as well as evaluate reasoning ability to ensure reliability and credibility in different environments.

Researchers construct AGIEval datasets based on two main design principles:

1. The main goal of "people-oriented" cognitive task design at the human brain level is to focus on tasks closely related to human cognition and problem-solving, and to evaluate the generalization ability of the basic model in a more meaningful and comprehensive way.

To achieve this goal, researchers have chosen a variety of official, open and high-standard admissions and qualification examinations to meet the needs of the average human examinee, including college entrance exams, law school admissions exams, math exams, bar exams and national civil service exams, which are taken every year by millions of people seeking higher education or new career paths.

By complying with these officially recognized criteria for assessing human-level capabilities, AGIEval can ensure that the assessment of model performance is directly related to human decision-making and cognitive abilities.

two。 Relevance to real-world scenarios by selecting tasks from high-standard entrance and qualification examinations, we can ensure that the results of the assessment reflect the complexity and practicality of the challenges often encountered by individuals in different fields and contexts.

This method can not only measure the performance of the model in human cognitive ability, but also better understand the applicability and effectiveness in real life, that is, it will help to develop artificial intelligence systems that are more reliable, more practical and more suitable for solving a wide range of real-world problems.

Based on the above design principles, the researchers chose a variety of standardized high-quality tests, emphasizing the relevance between human-level reasoning and the real world, including:

1. Entrance examination of ordinary colleges and universities

College entrance examination, which includes a variety of subjects and requires critical thinking, problem-solving and analytical skills, is an ideal choice to evaluate the performance of large-scale language models and human cognition.

These include the Postgraduate entrance examination (GRE), the academic Assessment examination (SAT) and the China National College entrance examination (Gaokao), which can assess the general ability and subject-specific knowledge of students seeking admission to higher education institutions.

The data collected eight exams corresponding to the Chinese National College entrance examination: history, Mathematics, English, Chinese, Geography, Biology, Chemistry and Physics; selecting math questions from GRE; and selecting English and math subjects from SAT to build benchmark datasets.

two。 Law school entrance examination

Law school entrance exams, such as LSAT, are designed to measure the reasoning and analytical abilities of future law students, including logical reasoning, reading comprehension and analytical reasoning, and require candidates' ability to analyze complex information and draw accurate conclusions. these tasks can assess the ability of language models in legal reasoning and analysis.

3. Lawyer qualification examination

The legal knowledge, analytical ability and moral understanding of individuals pursuing the legal profession can be assessed, and the examination covers a wide range of legal topics, including the Constitution, contract law, criminal law and property law. Candidates are also required to demonstrate their ability to effectively apply legal principles and reasoning, and the performance of language models can be evaluated in the context of professional legal knowledge and moral judgment.

4. Graduate Management entrance examination (GMAT)

GMAT is a standardized test that can evaluate the analytical, quantitative, verbal and comprehensive reasoning skills of future business school graduate students. It consists of analytical writing assessment, comprehensive reasoning, quantitative reasoning and verbal reasoning, and assesses candidates' ability to think critically, analyze data and communicate effectively.

5. High school math competition

These competitions cover a wide range of mathematical topics, including number theory, algebra, geometry and combinatorics, and often arise unconventional problems that need to be solved in creative ways.

It includes the American Mathematical Competition (AMC) and the American Mathematical invitation Test (AIME), which can test students' mathematical ability, creativity and problem-solving ability, and further evaluate the ability of language models to deal with complex and creative mathematical problems, as well as the ability of models to generate novel solutions.

6. Domestic civil servant examination

The abilities and skills of individuals seeking to enter the civil service can be assessed, including general knowledge, reasoning skills, language skills, as well as professional knowledge in specific subjects related to the roles and responsibilities of various civil service positions in China, it is possible to measure the performance of language models in the context of public administration and their potential for policy formulation decision-making and public service delivery.

The models selected for the evaluation results include:

ChatGPT, a conversational artificial intelligence model developed by OpenAI, can participate in user interaction and dynamic dialogue, train with a large set of instruction data, and further adapt through reinforcement learning and human feedback (RLHF) to provide context-sensitive and coherent responses consistent with human expectations.

GPT-4, as the fourth generation GPT model, contains a wider range of knowledge base and shows human-level performance in many application scenarios. GPT-4 uses adversarial testing and ChatGPT to make repeated adjustments, resulting in significant improvements in factuality, guidance, and compliance with the rules.

Text-Davinci-003 is an intermediate version between GPT-3 and GPT-4 that performs better than GPT-3 when fine-tuned by instructions.

In addition, the average and highest scores of human candidates are also reported in the experiment, which serves as the human level limit for each task, but does not fully represent the range of skills and knowledge that human beings may have.

Zero-shot / Few-shot evaluation under the zero sample setting, the model evaluates the problem directly; in the small sample task, enter a small number of examples in the same task (such as 5) before evaluating the test sample.

In order to further test the reasoning ability of the model, the CoT hint is also introduced in the experiment, that is, first enter the prompt "Let's think step by step" to generate an explanation for a given question, and then enter the prompt "Explanation is" to generate the final answer according to the explanation.

The "multiple choice questions" in the benchmark use the standard classification accuracy; the "fill in the blanks" use exact matching (EM) and F1 indicators.

From the experimental results, we can find that:

1. GPT-4 is obviously better than its similar products in all task settings, including 93.8% accuracy on Gaokao-English and 95% accuracy on SAT-MATH, indicating that GPT-4 has excellent general ability in dealing with people-oriented tasks.

2. ChatGPT is significantly better than Text-Davinci-003 in tasks that require external knowledge, such as those involving geography, biology, chemistry, physics and mathematics, indicating that ChatGPT has a stronger knowledge base and is better able to handle tasks that require a deep understanding of specific areas.

On the other hand, in all evaluation settings, ChatGPT slightly outperforms Text-Davinci-003 or achieves considerable results in tasks that require pure understanding and do not rely heavily on external knowledge, such as English and LSAT tasks. This observation means that both models can handle tasks centered on language understanding and logical reasoning without the need for specialized domain knowledge.

3. Although these models perform well overall, all language models perform poorly in complex reasoning tasks, such as MATH, LSAT-AR, GK-physics, and GK-Math, highlighting the limitations of these models in dealing with tasks that require advanced reasoning and problem-solving skills.

The observed difficulties in dealing with complex reasoning problems provide opportunities for future research and development in order to improve the general reasoning ability of the model.

4. Compared with zero-shot learning, few-shot learning usually can only bring limited performance improvement, indicating that the zero-shot learning ability of large language models is approaching the learning ability of few-shot, and it also marks a great progress compared with the original GPT-3 model, when the performance of few-shot is much better than zero-shot.

A reasonable explanation for this development is that human adjustment and instruction adjustment are strengthened in the current language model, and these improvements enable the model to better understand the meaning and background of tasks in advance, so that they can perform well even in the case of zero-shot, proving the effectiveness of instructions.

Reference:

Https://arxiv.org/pdf/2304.06364.pdf

This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

IT Information

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report