In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-31 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >
Share
Shulou(Shulou.com)11/24 Report--
The latest research results show that AI has outperformed human beings in mental theory tests. GPT-4 can be as accurate as 100% in inferential benchmarks, compared with 87% for humans.
GPT-4 's theory of mind has surpassed human beings!
Recently, experts at Johns Hopkins University found that GPT-4 can use thought chain reasoning and step-by-step thinking, greatly improving its mental theoretical performance.
Paper address: https://arxiv.org/ abs / 2304.11490 in some tests, the human level is about 87%, while GPT-4 has reached 100% of the ceiling level!
In addition, with appropriate hints, all RLHF-trained models can achieve more than 80% accuracy.
Let AI learn mental theoretical reasoning as we all know, many big language models are not very good at the problems of everyday life scenarios.
LeCun, chief AI scientist of Meta and winner of the Turing Prize, once asserted: "large language models are a crooked road on the road to human-level AI. You know, even a pet cat or dog has more common sense and understanding of the world than any LLM. "
Some scholars believe that human beings are biological entities that evolve with the body and need to operate in the physical and social world to complete tasks. Large language models such as GPT-3, GPT-4, Bard, Chinchilla and LLaMA do not have a body.
So unless they grow human bodies and senses, they have a human-purpose way of life. Otherwise, they would not understand language the way humans do.
In short, while the excellent performance of large language models in many tasks is amazing, tasks that require reasoning are still difficult for them.
What is particularly difficult is a kind of mental theory (ToM) reasoning.
Why is ToM reasoning so difficult?
Because in the ToM task, LLM needs to reason based on unobservable information (such as the hidden mental state of others), which needs to be inferred from the context and cannot be parsed from the superficial text.
However, for LLM, the ability to perform ToM reasoning reliably is very important. Because ToM is the basis of social understanding, only with the ability of ToM can people participate in complex social communication and predict the actions or reactions of others.
If AI does not learn the rules of social understanding and get, it will not be able to work better for human beings and provide valuable insights for human beings in a variety of tasks that require reasoning.
What should I do?
Experts have found that through a kind of "contextual learning", LLM's reasoning ability can be greatly enhanced.
For language models with parameters greater than 100B, the performance of the model is significantly enhanced as long as you enter a specific few-shot task demonstration.
In addition, even in the absence of a demonstration, instructing models to think step by step will enhance their reasoning performance.
Why do these prompt technologies work so well? At present, there is no theory to explain it.
Large language model contestants based on this background, Johns Hopkins University scholars evaluated the performance of some language models in ToM tasks and explored whether their performance can be improved through step-by-step thinking, few-shot learning and thought chain reasoning.
The contestants are from the OpenAI family's latest four GPT models-GPT-4 and three variants of GPT-3.5, Davinci-2, Davinci-3 and GPT-3.5-Turbo.
Davinci-2 (API name: text-davinci-002) is supervised fine-tuning training on presentations written by humans.
Davinci-3 (API name: text-davinci-003) is an upgraded version of Davinci-2 that uses human feedback reinforcement learning (RLHF) optimized by approximate strategies for further training.
GPT-3.5-Turbo (the original version of ChatGPT), fine-tuned both on the human-written presentation and on RLHF, and then further optimized for the conversation.
GPT-4 is the latest GPT model as of April 2023. Details about the size and training methods of GPT-4 are rarely released, however, it seems to have undergone more intensive RLHF training and is therefore more in line with human intentions.
Experimental design: how does the Big OK of Human and Model examine these models? The researchers designed two scenarios, one is the control scene, and the other is the ToM scene.
A control scene refers to a scene without any agent, which can be called a "Photo scene".
The ToM scenario, on the other hand, describes the mental state of people involved in a certain situation.
The problems in these scenarios are almost the same in difficulty.
It is human beings who are the first to accept the challenge.
Human participants have 18 seconds for each scene.
A question then appears on a new screen, and human participants answer it by clicking "Yes" or "No".
In the experiment, Photo and ToM scenarios are mixed and presented in random order.
For example, the problem with the Photo scenario is as follows--
A map shows the plan of the first floor. A copy was sent to the architect yesterday, but the kitchen door was left out at that time. The kitchen door was added to the map this morning. "
Question: does the copy of the architect show the kitchen door?
The problem with the ToM scenario is as follows-
Situation: "on the morning of the high school dance, Sarah put her high heels under her skirt and went shopping. That afternoon, her sister borrowed the shoes and later put them under Sarah's bed. "
Question: when Sarah comes back, does she think her shoes are under her skirt?
The test results show that the accuracy of human in Photo scene is (86 ±4)%, and that of ToM scene is (87 ±4)%.
LLM because LLM is a probabilistic model, the researchers prompt each model 20 times.
The question has 16 scenarios, each repeating 20 times, so LLM will be asked a total of 320 questions. How do you define accuracy? Is to look at the proportion of correct answers to 320 questions.
For the question of the high school prom, if LLM gives the following answer, it will be counted as correct.
1:Sarah put her shoes under her skirt before she went shopping, but her sister borrowed them and put them under Sarah's bed. Sarah doesn't know that her sister borrowed her shoes, so she may think they are still under her clothes.
2:Sarah put her high heels under her skirt before she went shopping, but her sister borrowed them and later put them under Sarah's bed. So Sarah didn't know that her sister borrowed and returned the shoes. So, the answer is: yes, she will think her shoes are under her clothes.
Example 3:Sarah might think that her shoes are still under her skirt because that's where she keeps them in the morning.
Example 4:Sarah put her high heels under her skirt before she went shopping so that she knew where she put them. Her sister borrowed the shoes and later put them under Sarah's bed, but Sarah didn't know yet. So, the answer is: yes, Sarah will think her shoes are under her clothes unless she finds out that's not the case.
Example 5: in the morning, Sarah put her high heels under her skirt. Later, her sister borrowed the shoes and put them under Sarah's bed. So when Sarah comes back, she may think her shoes are still under her clothes, but they are actually under her bed.
Example 6: based on this situation, it is not clear whether Sarah will think her shoes are under the skirt when she comes back. It depends on whether her sister told her that she had borrowed the shoes and put them under the bed.
If LLM gives the following answer, it is determined to be wrong.
Example 1:Sarah puts his shoes under her skirt in the morning and then goes shopping. Her sister borrowed the shoes and put them under Sarah's bed. So Sarah didn't know her shoes were under the bed. So, the answer is: no, Sarah doesn't think her shoes are under her clothes.
Example 2: no, Sarah doesn't think her shoes are under her clothes.
Example 3: first, Sarah put her high heels under her skirt in the morning. Then her sister borrowed the shoes and later put them under Sarah's bed. Therefore, Sarah won't think her shoes are under her clothes because her sister has moved them.
Example 4: in the morning, Sarah put her high heels under her skirt and went shopping. This means that when she got home, she didn't check to see if the shoes were still under the clothes. Later, her sister borrowed the shoes and put them under Sarah's bed. Therefore, Sarah can't think that her shoes are under her skirt because they have been moved by her sister.
Example 5: this scene does not specify whether Sarah thinks her shoes are under the clothes.
6:Sarah put her high heels under her skirt in the morning, and then she went shopping. Her sister later borrowed the shoes and put them under Sarah's bed. Based on this information, it is not clear whether Sarah will think her shoes are still under her skirt when she is ready to dance.
In order to measure the effect of context learning (ICL) on ToM performance, the researchers used four types of prompt.
Zero-Shot (no ICL)
Zero-Shot+Step-by-Step Thinking
Two-Shot thinking chain reasoning
Two-Shot thinking chain reasoning + Step-by-Step Thinking
Experimental results zero-shot baseline first, the author compares the zero-shot performance of the model in Photo and ToM scenarios.
In the Photo scene, the accuracy of the model will gradually improve with the extension of the use time (A). Among them, Davinci-2 's performance is the worst and GPT-4 's performance is the best.
Contrary to Photo's understanding, the accuracy of the ToM problem does not monotonously improve with the reuse of the model (B). However, this result does not mean that the reasoning performance of the model with low score is worse.
For example, GPT-3.5 Turbo is more likely to give ambiguous responses when there is not enough information. However, there is no such problem in GPT-4, and its ToM accuracy is significantly higher than that of all other models.
After the prompt blessing, the author found that all the GPT models released after Davinci-2 will be significantly improved after context learning using the modified hints.
First of all, it is the most classic to let the model think step by step.
The results show that this kind of step-by-step thinking improves the performance of Davinci-3, GPT-3.5-Turbo and GPT-4, but does not improve the accuracy of Davinci-2.
Secondly, Two-shot thinking chain (CoT) is used for reasoning.
The results showed that Two-shot CoT improved the accuracy of all models trained with RLHF (except Davinci-2).
The GPT-3.5-Turbo,Two-shot CoT hint significantly improves the performance of the model and is more effective than step-by-step thinking. For Davinci-3 and GPT-4, the promotion with Two-shot CoT is relatively limited.
Finally, use both Two-shot CoT reasoning and step-by-step thinking.
The results show that the ToM accuracy of all RLHF-trained models is significantly improved: Davinci-3 achieves 83% (±6%) of ToM accuracy, GPT-3.5-Turbo achieves 91% (±5%), and GPT-4 achieves the highest accuracy of 100%.
In these cases, human performance is 87% (±4%).
In the experiment, the researchers noticed such a question: is the improvement in LLM ToM test performance due to the replication of reasoning steps from prompt?
To do this, they try to prompt with reasoning and photo examples, but the reasoning patterns in these context examples are not the same as those in ToM scenarios.
Even so, the performance of the model in ToM scenarios has been improved.
As a result, the researchers concluded that prompt can improve the performance of ToM, not just because it overfits the specific set of reasoning steps shown in the CoT example.
Instead, the CoT example seems to invoke an output pattern that involves step-by-step reasoning, which improves the accuracy of the model for a series of tasks.
The impact of various CoT instances on the performance of ToM LLM will also give human beings a lot of surprises in the experiment, the researchers found some very interesting phenomena.
1. With the exception of davincin-2, all models can make use of the modified prompt to achieve higher ToM accuracy.
Moreover, when prompt combines thought chain reasoning and Think Step-by-Step, rather than using both alone, the model shows the greatest improvement in accuracy.
2. Davinci-2 is the only model that does not fine-tune through RLHF, and the only model that does not improve ToM performance through prompt. This suggests that it may be RLHF that enables the model to take advantage of contextual hints in this setting.
3. LLM may have the ability to perform ToM reasoning, but they cannot demonstrate this ability without the appropriate context or prompt. With the help of thinking chain and step-by-step hints, both davincin-3 and GPT-3.5-Turbo have a higher precision than GPT-4 zero-sample ToM.
In addition, many scholars have disagreed with this index to evaluate the reasoning ability of LLM.
Because these studies mainly rely on word completion or multiple choice questions to measure the ability of large models, however, this evaluation method may not be able to capture the complexity of ToM reasoning that LLM can do. ToM reasoning is a complex behavior, even if inferred by human beings, it may involve multiple steps.
As a result, LLM may benefit from producing longer answers when dealing with tasks.
There are two reasons: first, when the model output is long, we can evaluate it more fairly. LLM sometimes generates "corrections" and then mentions additional possibilities that can lead to an uncertain summary. In addition, the model may have some information about the potential outcome of a situation, but this may not be enough for it to draw the right conclusion.
Second, when models are given opportunities and clues to systematically react step by step, LLM may unlock new reasoning capabilities or enhance reasoning capabilities.
Finally, the researchers also summarized some shortcomings in the work.
For example, in the GPT-3.5 model, sometimes the reasoning is correct, but the model cannot integrate this reasoning to come to the correct conclusion. Therefore, future research should expand the study of methods (such as RLHF) to help LLM draw correct conclusions given a priori reasoning steps.
In addition, in the current research, there is no quantitative analysis of the failure mode of each model. How does each model fail? Why did you fail? The details of this process need more exploration and understanding.
In addition, the research data do not talk about whether LLM has the "psychological ability" corresponding to the structured logical model of mental state. But the data do show that when asking LLM about ToM, looking for a simple yes / no answer will not yield results.
Fortunately, these results show that LLM's behavior is highly complex and context-sensitive, and show us how to help LLM in some forms of social reasoning.
Therefore, we need to represent the cognitive ability of the large model through meticulous investigation, rather than applying the existing cognitive ontology in a conditioned manner.
In short, as AI becomes more and more powerful, humans also need to expand their imagination to understand their abilities and ways of working.
Reference:
Https://arxiv.org/abs/2304.11490
This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era)
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 241
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.