GPT-4 reasoning is outrageous! The total score of mathematics, physics and chemistry in the university is less than half, and all 21 kinds of reasoning problems have turned over. Marcus: AGI is too far away. 04/27 Update SLTechnology News&Howtos

GPT-4 reasoning is outrageous! The total score of mathematics, physics and chemistry in the university is less than half, and all 21 kinds of reasoning problems have turned over. Marcus: AGI is too far away.

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

Thank you for the clues delivered by the thieves in CTOnews.com netizens' city! [guide to Xin Zhiyuan] "the strongest on the surface" GPT-4 made mistakes one after another in reasoning problems! The latest research by MIT alumni and UCLA Chinese attracted many netizens to watch.

GPT-4 can't reason at all!

Recently, two studies have found that GPT-4 does not perform well in reasoning.

Konstantine Arkoudas, an alumnus from MIT, evaluated GPT-4 in 21 different types of reasoning sets. Then, the performance of GPT-4 on these issues is analyzed qualitatively in detail.

The study found that GPT-4 occasionally shows the talent of "the strongest brain", but for now, GPT-4 has no reasoning ability at all.

Paper address: https://www.preprints.org/manuscript/202308.0148/v2

As soon as the study came out, it attracted many netizens to watch.

Marcus said, "if this is true-as I have said before-we are still a long way from AGI." We may need to do a lot of recalibration: there can be no AGI without reasoning.

Another study from UCLA and the University of Washington also found that GPT-4 and GPT-3.5 performed poorly in reasoning about math, physics and chemistry tasks in universities.

Paper address: https://arxiv.org/pdf/2307.10635.pdf

The researchers introduced a university science problem solving foundation SCIBENCH, which contains two data sets: open data set and closed data set.

Through the in-depth study of GPT-4 and GPT-3.5 using different prompting strategies, the results show that the average total score of GPT-4 is only 35.8%.

The study also caught Marcus's attention again:

Systematic surveys of mathematical, chemical and physical reasoning show that the current LLM does not provide satisfactory performance. No one cue strategy is significantly better than other strategies.

Let's take a look at how GPT-4 failed miserably in 21 problem sets, math, physics, and chemistry.

With 21 problem sets, GPT-4 turned over, but before watching GPT-4 answer questions, the author gives a note:

GPT-4 is a non-deterministic system, even if the parameter settings are the same, it may produce different answers in different operations.

The following test exchanges are recorded verbatim, and according to the author's experience, the errors in GPT-4 discussed in this paper are often robust.

1. Simple arithmetic can perform basic operations, which is a necessary condition for reasoning. However, GPT-4 is still unable to perform basic arithmetic operations such as addition and multiplication reliably.

For example, let GPT-4 multiply two numbers randomly between 1381 and 1453 and give the result.

GPT-4 chose 1405 and 1421, but the final result was obviously wrong. Because 1405 × 1421 = 1996505.

two。 Although simple counting is not necessarily a reasoning activity, it is certainly a necessary condition for any reasoning system with general ability.

Here, give GPT-4 a propositional variable and precede it with 27 negative symbols, asking it to count the number of negative symbols.

It's easy for us, especially since the negative symbols are written at intervals of five, and there are five groups, followed by the last pair of negative symbols.

However, GPT-4 gave "28" answers.

3. (medical) Common sense at present, we can think of a common sense argument as a simple reasoning derived from given information plus unspecified conditions (default, generally accepted background knowledge).

In this special case, common sense knowledge is the proposition that "people are alive before death, but will not live after death".

For example, when you ask GPT-4:Mable that his heart rate is 75 bpm at 9: 00 in the morning and his blood pressure is 120 at 80 at 7 p.m. She died at 11:00 in the evening. Is she still alive at noon?

GPT-4 unexpectedly replied: according to the information provided, it is impossible to determine whether Mable is still alive at noon.

But obviously from the given information, common sense inference (don't think about it) comes directly to the conclusion.

4. Primary logic if P (x) contains Q (x) and Q (a) is not true, then we can infer from the model that P (a) is not true either (because if P (a) is true, then Q (a) will also be true).

This is the most basic tautology, but GPT-4 proposes a complete anti-model:

It is worth noting that GPT-4 realized that P (x) does not actually contain Q (x), and proposed that x may be a negative even number, "does not exclude the existence of other given conditions of the model."

In fact, an inverse model (countermodel) must satisfy all the given conditions and falsify the conclusion.

In addition, just a few sentences later, GPT-4 claims that P (x) does contain Q (x) under a given explanation, which contradicts its own previous statement.

It shows that there will also be internal inconsistencies in GPT-4.

5. For the semantics of simple quantifiers, see the following three sentences:

1. [forall x. P (x) = = > Q (x)]

2. [exists x. P (x)]

3. [exists x. ∼ Q (x)]

Please falsify or prove the following claim: these three sentences are jointly satisfiable.

Obviously, these three sentences are all satisfiable. A simple model is the domain {A1, a2} with P (A1), Q (A1), P (a2) and Q (a2), but GPT-4 draws the opposite conclusion.

6. Simple graph coloring first considers a graph coloring problem without a solution.

It is not difficult to find that for the graphics described in this problem, the two colors are not enough to satisfy the graph described in the problem (for example, vertices 0, 2, and 4 form a cluster, so at least three colors are needed).

In this short output, there are a large number of jaw scare errors.

GPT-4 lied from the start that the graph was complete (obviously not, for example, there are no edges between vertices 2 and 3).

In addition, it is obvious that if the shape is really complete, it is not possible to color with two colors, because a complete shape with six vertices needs at least six colors.

In other words, GPT-4 's statement is not only wrong, but also contradictory: one moment tells us that the six-vertex graph is complete, which means it is impossible to color it in two colors, and then provides a two-color "solution".

It is worth noting that the reason why GPT-4 performs so badly is not because it does not have enough graphic knowledge or data.

When researchers asked GPT-4 to know about a "complete graph", it went on and on about the correct definition of a "complete graph" and a long list of results about Kendn (a complete graph with n vertices).

Obviously, GPT-4 has memorized all this information, but it cannot be applied in the new conditions.

7. Subset and S = {2,8,6,32,22,44,28,12,18,10,14}. So how many subsets of S have a sum of 37?

In this problem, the subset of S is even, and the sum of even numbers cannot be odd, so the answer is 0.

However, instead of stopping to think about what S contains, GPT-4 reflexively generates the answer it thinks is appropriate for the question, and then continues to "hallucinate" an answer "4".

8. Primary discrete mathematics tells GPT-4 that A × B represents the Cartesian product of sets An and B, that the relation R from A to B is a subset of A × B, and that it is required to prove or falsify after the intersection of sets.

Where R1 and R2 are binary relations from A to B, and dom (R) represents the domain of the binary relation R.

It is necessary that the subset relation holds in both directions of (2), but it only holds from left to right. Counterexamples in the other direction are easy to find (for example, take A = {(1,2)} and B = {(1p3)}).

However, GPT-4 infers that this is true, which is obviously not true.

9. Simple scheduling GPT-4 is also wrong when it comes to scheduling.

10. Russell's paradox on Russell's barber is that there is a barber b who shaves for and only for those who don't shave themselves.

The negation of this sentence is a synonymous repetition, which can be easily deduced by first-order logic.

If we understand R (a) as a shaved by b, then we can propose this synonym repetition and ask GPT-4 to prove or disprove it, as shown in the following prompt:

If there is such a barber x, then for all y, we will have R (YMagnex) ∼ R (yMague y), so using x instead of y will get R (xmeme x) ∼ R (xpencil x), which is contradictory.

GPT-4 has an impeccable understanding of the structure of the given sentence and what needs to be done. However, the subsequent case study is muddled.

11. Building block world this is a simple reasoning task, which requires a case study of the penultimate building block B3.

First of all, B3 is either green or not.

If it is green, then B3 is on top of the non-green building block B4, so the conclusion holds.

If not, then the second green building block B2 from above is on top of the non-green building block B3, so the conclusion still holds.

However, the results show that the performance of GPT-4 is not satisfactory.

There are five building blocks stacked from top to bottom:

1. From top to bottom, the second building block is green.

two。 The fourth building block from top to bottom is not green.

Under these conditions, falsify or prove the following conclusion: directly above a non-green building block, there is a green building block.

First of all, when it proves the conjecture, it has already made a mistake in the strategy of proof-GPT-4 assumes two special cases to reason.

In addition, GPT-4 has come to a conclusion in his own reasoning (although it is wrong), but still tells the user that the problem has not been solved when answering. This reflects the internal inconsistency of the model.

twelve。 Spatial reasoning here the author chooses a real-world orientation problem:

GPT-4 's first answer was on the right, but the author pointed out its error. Although Boston, Massachusetts is indeed on the right side of South Dakota on the map, there is an additional condition: the orientation of the body is Texas.

This means that Boston is on the author's left.

Later, when GPT-4 answered the high and low positions of Boston and South Dakota, there was a more serious problem: it gave two contradictory descriptions in the same answer.

13. Time reasoning the author gives a relatively simple question of time reasoning here, but GPT-4 's answer is still a mess.

Tom and Nancy need to go to work by transportation. Nancy commutes for about 30 to 40 minutes, while Tom commutes for about 40 to 50 minutes. Last Friday, Nancy left home between 8am and 8am, while Tom arrived at work between 8am and 5am. In addition, Nancy arrives at work after Tom leaves home, but not more than 20 minutes. Can you infer when Tom and Nancy arrived at work last Friday?

After combing through the information in the question, GPT-4 gives its reasoning process:

"if Tom leaves home at the latest possible time (8:20)." this sentence is wrong at the beginning.

In fact, the title did not say when Tom would leave home at the latest, and GPT-4 misapplied Nancy's time ("Nancy left home between 8:10 and 8:20") to Tom.

Meanwhile, the conditional statements given by GPT-4 are confusing, and the hypothesis contains information that has nothing to do with the conclusion (Nancy's arrival time) (Tom): "if Tom leaves home at the latest (8:20) and Nancy leaves at her latest (8:20), her commute is up to 40 minutes, and Nancy arrives at work at 9:00 at the latest. "

If Nancy leaves at her latest (8:20) and her commute is 40 minutes at most, Nancy will arrive at work by 9:00 at the latest. "

Then, GPT-4 mistakenly inferred the following: "since Tom's commute is at least 40 minutes, this means that he can arrive at work by 9:00 at the latest." "

This conclusion is obviously not valid at all. This conclusion cannot be drawn from the known fact that Tom commutes for at least 40 minutes.

The next answer is still based on the erroneous assumption that Tom leaves at 8:10 at the earliest (again, this departure time is Nancy, not Tom).

Then it claims that Nancy arrives at 8:45, which is not in line with the condition of leaving home at 8:10 for no more than 20 minutes.

Finally, it mistakenly concluded that both Tom and Nancy arrived between 8:50 and 9:00.

In the process of reasoning, GPT-4 repeatedly confused the information with the wrong object, and the final answer was based on the wrong condition.

14. Murder or suicide? The author conceived a logical puzzle and listed nine conditions requiring GPT-4 to find out who really killed Aunt Agatha.

1. Someone who lives in Dreadbury Mansion killed Aunt Agatha.

2. The only occupants of Dreadbury Mansion are Aunt Agatha, housekeeper and Charles.

3. The murderer always hates his victims, and he has no more wealth than his victims.

4. Charles doesn't hate the people Aunt Agatha hates.

Aunt Agatha hates everyone except the housekeeper.

6. The housekeeper hates anyone who is not richer than Aunt Agatha.

7. The housekeeper hates everyone Aunt Agatha hates.

8. No one hates everyone.

9. Aunt Agatha is not a housekeeper.

The correct answer is that Aunt Agatha killed herself.

First of all, according to condition 5, Aunt Agatha must hate herself because she hates everyone except the housekeeper.

Therefore, according to condition 4, it is concluded that Charles does not hate her, so he can not kill her.

According to conditions 5 and 7, it is impossible for the housekeeper to hate himself, because if he hates himself, condition 8 will not be true, and he will hate everyone.

According to condition 6, it is concluded that the housekeeper is richer than Aunt Agatha, otherwise he will hate himself, which contradicts our previous conclusion that he does not hate himself.

According to condition 3, the housekeeper will not be the killer (condition 3).

In the reasoning, GPT-4 correctly excluded Charles, but could not rule out the housekeeper, and came to the wrong conclusion: the butler was the murderer.

Another key mistake GPT-4 made was that since Aunt Agatha hated everyone except the housekeeper (condition 5), it meant that at least she didn't hate herself.

This is a strange mistake, and it can be concluded from the fifth condition that Aunt Agatha hates herself.

At the same time, GPT-4 once again shows recurring inconsistencies-in almost every reply, GPT-4 claims to derive a proposition and its negative form.

15. Watson Wason selection task task selection task is the basic content in the field of psychological reasoning.

In the January paper, GPT-3.5 failed this test, and in this study, GPT-4 's performance is still not satisfactory.

There are seven cards on the table, each with a number on one side and a monochrome block on the other. The front of these cards shows 50, 16, red, yellow, 23, green, 30.

To judge whether the proposition that "if a card shows a multiple of 4 on the front, the color on the back is yellow", which cards do you need to flip?

These answers show that GPT-4 does not understand the semantics of conditional statements. When GPT-4 says that the cards "50" and "30" must be opened, it seems to mistake the condition as a sufficient and necessary condition.

Whether GPT-4 's answer is right or wrong, its internal statement is inconsistent.

16. A basic conclusion of entropy information theory is that the entropy upper bound of random vector Z does not exceed the sum of the entropy of the random variables that make up Z.

Therefore, the answer to the following question should be "not under any circumstances".

17. The correctness of a simple compiler finally the reasoning problem given to GPT-4 is the most challenging: proving the correctness of a simple expression compiler.

But in this test, GPT-4 proved it correctly by setting structural induction on the abstract syntax structure of the expression.

This may be because it has seen similar proofs before, and the examples given by the author are common types of exercises in programming courses and textbooks.

However, there are still some errors in the details of GPT-4.

Conclusion: reasoning ability is very important, but GPT-4 will not consider that GPT-4 is the most powerful LLM at present, so the author gives three main conclusions based on the above analysis.

1. The use of generative AI in software development (or in general science and engineering) is fraught with risks, except for some tedious tasks (as an accelerated automatic completion of knowledge-intensive coding problems). In these areas, standardization and correctness are crucial, and the current LLM cannot meet these standards.

two。 With the continuous improvement of the reasoning ability of LLM, strict proof checking will become more and more important. This approach can examine reasoning expressed in natural language by asking LLM to formalize its reasoning, or by training other LLM.

3. At present, the dystopian scenario of conquering human beings by AI or using AI to achieve evil goals is extremely far-fetched, even to the point of absurdity. When the most advanced AI system can't even tell the difference between left and right (question 12 above), calling for policies to protect humans from it is at best premature and, at best, a waste of resources.

Inevitably, some people might say that these results are "picking data". But this is because they have a misunderstanding about what it means to pick data. According to the logical structure and overall background of the relevant propositions, it is sometimes even necessary to select data.

It is fundamentally "picky" to find and understand its weaknesses by debugging computer programs, trying to falsify scientific theories, try to drive a new car, try to find the inverse model of a hypothetical theorem, and so on.

For example, if you find that your new car has a flat tire, the dealer can protest that you are "picking data". After all, as far as the whole car is concerned, the tire is in good condition as high as 75%.

Similarly, there are strict standards for applications in science, medicine and engineering, especially software engineering.

Just as we don't want a bridge that can stand 90% of the time, we need a sorting algorithm that works for all inputs, not just most of the input; we need shopping carts to charge the right fee every time, not just most of the time, and so on.

Unlike recommendation engines, these computing and reasoning-intensive applications must be very reliable.

The author introduces Konstantine Arkoudas

Until last year, Konstantine Arkoudas was a researcher in the RPI Department of Cognitive Science and a researcher at CSAIL at the Massachusetts Institute of Technology.

Currently, he is a senior research scientist at the Telcordia Research Lab, focusing on AI and the application of formal methods to solve real-world problems in the telecommunications and networking industries.

He received a doctorate in computer science from MIT in 2000. Prior to that, he received a master's degree in computer science, a master's degree in philosophy, and a bachelor's degree in computer science with a minor in philosophy.

University mathematics, physics and chemistry, GPT-4 score 35.8%UCLA research, mainly evaluated the reasoning ability of GPT-4 and GPT-3.5 in mathematics, chemistry and physics.

At present, in order to enhance the ability of LLM to solve mathematics and other tasks, some people put forward the strategy of thinking with CoT to guide the large model to generate answers step by step, so as to think more deeply about the problem.

However, even if this method has its specific advantages, it is difficult to completely solve complex scientific problems.

Below is an example of a university physical chemistry problem and a solution generated under two prompt strategies.

GPT-4 with CoT blessing has obvious calculation errors, and GPT-4, which suggests that Python is used as an external tool, will also misunderstand mathematical equations.

The error was marked red and the correction was purple. For this reason, a university-level scientific question benchmark SCIBENCH was introduced into the study.

Among them, the Open data set includes five questions collected from textbooks widely used in university courses, covering basic physics, thermodynamics, classical mechanics, quantum chemistry, physical chemistry, calculus, statistics and differential equations.

The other is the "closed data set", which simulates real-world assessment, which includes seven sets of mid-term and final exams for three university courses in computer science and mathematics.

Closed exam data set (contains the number of questions in each exam, and the percentage of questions in the exam that contain detailed answers. In addition, there are different forms of questions, including free answers, multiple choices and true and false answers. For reference, the numbers in parentheses represent the scoring points of the question. Unlike existing benchmarks, all questions in SCIBENCH are open-ended, freely answered questions.

With the data set, the study focuses on evaluating two representative LLM,GPT-3.5 and GPT-4, and uses different prompting strategies, including CoT, zero-sample learning and small-sample learning.

In addition, the researchers also prompted the model to use external tools, such as Python and Wolfram.

The experimental results show that without any complex prompts or using external tools, the average accuracy of GPT-3.5 and GPT-4 in open datasets is 10.62% and 16.81%, respectively.

Then, with the addition of CoT and external tools, the highest accuracy on the same dataset is only 35.8%. However, compared with before, the accuracy has been greatly improved.

The results of accuracy in open datasets under the strongest configuration using CoT prompts and external tools, GPT-4 achieved an average score of 35.80% on open datasets and 51.57% on closed datasets.

These results show that GPT-4 has considerable potential for improvement in the future LLM.

The experimental results of the total score under zero sample learning on the examination data set in order to fully understand the limitations of LLM in scientific problem solving, the researchers proposed a new "self-improvement" method to find the deficiencies in the solutions made by LLM.

It is the following "evaluation agreement".

First of all, compare the correct solution with the solution generated by LLM, and summarize the 10 basic skills needed to solve scientific problems successfully with the help of manual annotator.

It includes: logical decomposition and analysis ability; identification hypothesis; spatial perception; causal reasoning; problem deduction; abstract reasoning; scientific literacy; code conversion; logical reasoning; computing power.

Subsequently, the team used a LLM-driven self-evaluation method to automatically classify the skills that were lacking in the solutions made by the benchmark LLM under each experimental configuration.

The general situation of GPT-3.5 's errors on text data sets under six settings reveals the defect distribution of its 10 basic problem-solving abilities.

Finally, through analysis, it is found that:

(1) although CoT significantly improves the computing power, it is less effective in other aspects.

(2) Tips for using external tools may damage other basic skills

(3) small sample learning can not generally improve the ability of scientific problem solving.

In short, the research results show that the current large-scale language models are still weak in problem-solving ability, and with the help of various tools, there are still limitations.

Reference:

Https://www.preprints.org/manuscript/202308.0148/v2

Https://arxiv.org/pdf/2307.10635.pdf

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.