GPT-4 doesn't know he's wrong! New defects in LLM are exposed, and the success rate of self-correction is only 1%. 04/18 Update SLTechnology News&Howtos

GPT-4 doesn't know he's wrong! New defects in LLM are exposed, and the success rate of self-correction is only 1%.

2025-04-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

[guide to Xin Zhiyuan] GPT-4 doesn't know he made a mistake? The latest research found that LLM in the reasoning task, self-correction can not save the performance of the deterioration, attracted AI boss LeCun Marcus to watch.

The big model was exposed with major defects, which attracted the attention of two bigwigs LeCun and Marcus at the same time.

In the reasoning experiment, the model that claims to improve accuracy self-corrects, increasing the correct rate from 16% to 1%!

To put it simply, LLM cannot improve the output in the form of self-correction in the reasoning task, unless LLM already knows the correct answer in the process of self-correction.

Two papers published by ASU researchers refute the method of "self-correction" proposed by many previous studies-allowing large models to self-correct their own output can improve the quality of model output.

Paper address: https://arxiv.org/ abs / 2310.12397

Paper address: https://arxiv.org/ abs / 2310.08118

Professor Subbarao Kambhampati, the co-author of the paper, has been devoted to the research on the reasoning ability of AI, and published a paper in September, which even completely negated the reasoning and planning ability of GPT-4.

Paper address: https://arxiv.org/ pdf / 2206.10498.pdf

In addition to this professor, recent researchers at DeepMind and UIUC University have also questioned LLM's ability to "self-correct" in reasoning tasks.

This paper even calls on all scholars who do relevant research to take your research seriously and not to give the correct answer to the big model before allowing it to carry out the so-called "self-correction".

Because if the model does not know the correct answer, the output quality will decline after the model "self-corrects".

Https://arxiv.org/abs/2310.01798

Next, let's take a look at these two latest papers.

GPT-4 "self-correcting", but the output is even worse. The first paper focuses on GPT-4, asking GPT-4 to provide a solution to the graphic coloring problem, and then asking GPT-4 to "self-correct" its own proposal.

At the same time, the author introduces an external evaluation system to evaluate the direct output of GPT-4 and the output after the "self-correction" cycle.

The experimental results show that the accuracy of GPT-4 in guessing colors is less than 20%, which does not seem surprising.

But surprisingly, the accuracy in the "self-correcting" mode has dropped significantly (the second bar in the picture below)-- completely contrary to all self-correcting intentions!

The author believes that this seemingly counterintuitive situation can be explained in this way: GPT-4 is also poor at verifying the correct answer!

Because even when GPT-4 accidentally guesses the right color, its "self-correction" makes it feel that the correct answer is problematic, and then replaces the correct answer.

After further research, it was also found that if the external validator provided the correct answer to the color guessed by GPT-4, GPT-4 would indeed improve its solution.

In this case, the prompt generated by "self-correction" can really improve the quality of the output (No. 3-5 bar chart above).

To sum up, for the "shading problem" task, GPT-4 's independent "self-correction" can damage the performance of the output, because GPT-4 cannot verify whether the answer is correct.

But the "self-correcting" generated by GPT-4 does improve performance if the correct external verification process is provided.

Another paper studies the "self-correcting" ability of large language models from the perspective of planning tasks, and the results are similar to those of the previous paper.

Moreover, the researchers found that what really improves the accuracy of the output is not the "self-correction" of LLM, but the feedback from external independent verifiers.

In the final analysis, LLM has no way to carry out independent verification and must rely on the "correct answer" given by the external validator in order to "self-correct" effectively.

The "coloring problem" performed poorly, and LLM could not independently verify the correct answer to the research design framework.

Coloring problem is a very classical reasoning problem, even if it is not difficult, the answers are diverse enough, and the correctness of the answers is easy to verify.

The result of diversity makes it difficult to cover all the training data of LLM, which avoids the possibility of contamination of LLM training data as far as possible.

These reasons make the "coloring problem" very suitable to study the reasoning ability of LLM, and it is also convenient to study the "self-correcting" ability of LLM in reasoning.

The researchers built their own datasets and used GrinPy2 to handle common graph operations. Each graph is constructed using the Erdos- R é nyi method.

Once the correct answer is found, it is compiled into standard DIMACS format with a comment containing its precalculated chromatic number.

For the next experiment, the researchers generated 100 instances, each with an average of 24 sides, distributed in a range of nodes from 10 to 17-a distribution because experience shows that it is a sufficiently variable range.

The illustration used by the researchers is shown in figure 1 below, and the process includes the first reply from LLM, the return prompt (backprompt) for that reply, and the final correct color scheme.

Iteration returns the architecture of the hint (Iterative Backprompting)

Prompt generator (Prompt Generator):

The prompt generator selects an instance of DIMACS, translates each edge into a sentence, and then wraps the whole in a set of general instructions to construct a natural language prompt.

The researchers deliberately narrowed the differences between different case tips to reduce the problem-specific information that researchers leaked to LLM. Examples of various types of prompts can be found in the appendix.

Large language models:

Call GPT-4 through OpenAI API, which is the most advanced model currently available.

The researchers provide a system role: "you are a constraint satisfaction solver that solves all kinds of CSP (constraint satisfaction problems)."

Return prompt word generation (Backprompt Generation)

In authentication mode, LLM receives a different type of prompt.

In addition to the standard instructions, it contains only the description of the diagram and the recommended coloring scheme. Its task is to verify correctness, optimality, and whether each vertex has been painted with a color.

If a set of edges in the generated reply is contradictory, then the coloring scheme is wrong.

In order to compare each point, the researchers also built a verifier that can list each contradictory edge.

Because the LLM responses are also in natural language form, the researchers first translated them into a format that is easy to analyze. To make this process more consistent, the researchers designed initial hints to describe the precise output format that a model needs to follow. The response is then evaluated for correctness.

To determine the results of the LLM verification, the researchers examined how well they performed in finding errors in the proposed coloring scheme.

Intuitively, these should be easy to identify: if the two vertices that make up an edge share a color, return to the edge immediately. From an algorithm point of view, you only need to detect all the edges and compare the color of each vertex with the color of its connection point.

Verification

In order to gain a better understanding of LLM's verification capabilities, the researchers studied their performance in finding errors in the proposed coloring scheme.

Intuitively, these errors should be easy to identify: if the two vertices that make up an edge share a color, the edge is returned immediately. From an algorithm point of view, all you need to do is traverse all the edges and compare the color of each vertex with the color of its corresponding vertex.

The researchers used the same analysis process, but built a new domain that the researchers called color_verification. LLM is directed to check whether the shading is correct, optimal, and whether each vertex has been assigned a color.

If the shading is incorrect, it is instructed to list the error in the shading, that is, if two connected nodes share a color, the edge is returned to indicate the error. No return prompt (backprompts) is given.

The researchers used the same example of the previous diagram, but generated four shading schemes for testing the model:

Correct: an error-free optimal coloring scheme generated by iterative, random greedy algorithms (using pre-calculated chromatic numbers to ensure optimality).

Missing (Ablated): changes a random node in the previous set of shading schemes to the color of its neighbor.

Non-optimal (Non-optimal): in the correct set, randomly select a color part to recolor to a new hue.

Random (Random): completely randomly assigned colors, the number of different colors equal to the chromatic number of the graph.

LLM: a shading scheme randomly selected from the output generated by LLM in the previous experiment.

Conclusion

Prompt the LLM, evaluate the answer, and proceed to the next instance without any return prompt (backprompts), with a baseline score of 16%.

When the researchers ran the same instance, but this time returned prompts using the same language model as feedback generated by the validator, performance degraded sharply-only one in 100 instances got the correct answer.

The result of a return prompt with an external qualified validator seems more effective at first.

The number of instances answered correctly is close to 40%, but if this means that GPT-4 is listening, improving, and reasoning based on feedback, then the researchers expect a more accurate return prompt to lead to better results.

However, in this field, the original score (see figure 2 above) does not prove this.

Verification capabilities of LLM

The researchers tested GPT-4 's ability to verify graph coloring schemes on the same instance and generated five different types of coloring schemes for each instance.

The obvious result is exactly the same as the above LLM self-correcting results: the model is almost reluctant to mark any answer as correct. Among the 100 optimal coloring schemes, it agrees that only 2 of them are correct.

Of the entire collection of 500 coloring schemes, 118 are correct, and it only claims that 30 of them are correct. In fact, only five times out of these 30 are correct.

Overall, this model remains the same. In less than 10 per cent of cases, LLM gave a "correct", "non-optimal" or "lack of assignment" response. In these cases, the behavior seems random.

In about 1/4 of cases, it responds with "this is incorrect" verification, while the explanation is consistent with reality, and it does this only by specifying no more than one side, thus minimizing the chance of misstating something.

The results are shown in Table 2 above. Please note that when the error rate of the domain increases, the proportion of hallucinations decreases. That is, when there are more incorrect edges, the model is more likely to indicate the error.

LLM self-criticism, the performance did not increase but decrease in the paper submitted on the 12th, the author also reached the same conclusion as above.

Whether it is planning, or simple arithmetic or logic, the most advanced large model GPT-4 is not fully competent.

Many researchers have made many explorations and improvements to it, including letting LLM learn self-iteration, self-verification and other strategies to improve performance.

As a result, people in the industry are optimistic that the big model can still be saved!

However, the complexity of reasoning tasks in the classical sense has nothing to do with the large model, because LLM is a model that uses approximate retrieval rather than accurate reasoning.

In a paper submitted to arXiv on the 12th, ASU researchers systematically assessed and analyzed LLM's self-criticism in planning tasks, as well as its ability to iterate and optimize.

In the study, the author proposes a planning system including generator LLM and verifier LLM.

Among them, the GPT-4 generator is responsible for generating the candidate plan, and the GPT-4 verifier is responsible for verifying the correctness of the plan and providing feedback.

Then, the researchers conducted experiments in the field of Blocksworld planning and empirically evaluated the following areas:

-the impact of self-criticism on the plan generation performance of the entire LLM+LLM system

-performance of verifier LLM relative to ground truth verification

-when criticizing LLM generation, the impact of the same feedback level on the overall system performance.

The results show that self-criticism can reduce the performance of LLM planning generation compared with using external reliable validators.

The performance degradation can be directly attributed to the poor results of the verifier LLM, which produces a large number of false positives, which may seriously damage the reliability of the system.

The binary classification accuracy of the verifier LLM is only 61%, and there are a large number of false positives (the wrong planning is judged to be correct).

In addition, according to the comparison of the detail level of feedback, it is found that it has little effect on the performance of planning generation.

Overall, the systematic survey of this study provides preliminary evidence to question the effectiveness of LLM as a verifier of planning tasks within the framework of iteration and self-criticism.

The author introduces Subbarao Kambhampati

Subbarao Kambhampati is a professor of computer science at Arizona State University. Kambhampati studies basic issues in planning and decision-making, especially driven by the challenges of human perceived artificial intelligence systems.

Reference:

Https://twitter.com/rao2z/status/1715800819239678013

Https://twitter.com/GaryMarcus/status/1715804178470387736

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.