GPT-4 can't reason graphically? The accuracy rate after "releasing water" is still only 33%. 04/16 Update SLTechnology News&Howtos

GPT-4 can't reason graphically? The accuracy rate after "releasing water" is still only 33%.

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

The graphic reasoning ability of GPT-4 is less than half that of human beings.

A study by the Santa Fe Institute in the United States shows that the accuracy of GPT-4 in doing graphic reasoning questions is only 33%.

GPT-4v, with multimodal capabilities, performed even worse, only getting 25% of the questions right.

The △ dotted line represents the average performance of 16 tasks

The results of the experiment quickly sparked a heated debate on YC after it was published.

Netizens who agree with the results said that GPT is really not good at abstract graphics, and concepts such as "position" and "rotation" are more difficult to understand.

But on the other hand, many netizens also question this conclusion. To put it simply:

It can't be said to be wrong, but it's not convincing to say it's completely right.

As for the specific reasons, let's move on.

GPT-4 accuracy is only 33%. To assess the performance of humans and GPT-4 on these graphics problems, the researchers used the ConceptARC data set launched by their own organization in May this year.

ConceptARC includes a total of 16 sub-categories of graphic reasoning questions, 30 questions in each category, a total of 480 questions.

These 16 subcategories cover many aspects, such as position relationship, shape, operation, comparison and so on.

Specifically, these problems are composed of pixel blocks, so humans and GPT need to find out the rules according to the given examples and analyze the results of the images processed in the same way.

In this paper, the author shows the examples of these 16 sub-categories, one for each category.

Results the average correct rate of 451 human subjects was not less than 83% of the sub-items, and the average of 16 tasks reached 91%.

While GPT-4 (single sample) can try three times from "releasing water" to a question (even if it is right once), the highest accuracy is no more than 60%, with an average of only 33%.

Earlier, the authors of the ConceptARC Benchmark involved in this experiment conducted a similar experiment, but the zero-sample test in GPT-4 showed that the average accuracy of the 16 tasks was only 19%.

On the contrary, the accuracy of multimodal GPT-4v is even lower. In a small ConceptARC dataset composed of 48 questions, the accuracy of zero sample and single sample test is only 25% and 23%, respectively.

After further analysis of the wrong answers, the researchers found that some human errors are likely to be "caused by carelessness", while GPT does not understand the rules in the question at all.

Netizens generally have no doubt about the data, but what makes the experiment questionable is the number of people recruited and the way they input GPT.

The way subjects were chosen was questioned at first, when the researchers recruited subjects on one of Amazon's crowdsourcing platforms.

The researchers selected some simple questions from the data set as an introductory test, and subjects were required to answer at least two of the three random questions correctly in order to enter the formal test.

As a result, the researchers found that the results of the entry test showed that some people just wanted to get the money, but did not do the questions at all.

As a last resort, the researchers raised the threshold for taking the test to no less than 2000 tasks completed on the platform, with a pass rate of 99%.

However, although the author uses the pass rate to screen people, in terms of specific abilities, there are "no special requirements" for other professional abilities such as graphics except that the subjects are required to speak English.

In order to diversify the data, the researchers transferred the recruitment work to another crowdsourcing platform at the end of the experiment, and a total of 415 subjects participated in the experiment.

Nevertheless, some people questioned that the samples in the experiment were "not random enough".

Other netizens pointed out that there are large models impersonating human beings on the Amazon crowdsourcing platform that researchers use to recruit subjects.

Let's take a look at the operation on the GPT side. The multimodal version is relatively simple. Just send the picture directly and use the prompt like this:

In the zero sample test, as long as the corresponding EXAMPLE part is removed.

However, for the plain text version of GPT-4 (0613) without multimode, it is necessary to convert the image into lattice points and use numbers instead of colors.

In response to this kind of operation, some people disagree:

After converting the image into a digital matrix, the concept has completely changed, and even human beings may not be able to understand when they look at the "graphics" represented by numbers.

One More Thing is similar. 喜悦 Hsu, a Chinese doctoral student at Stanford, also tested GPT-4v 's ability to understand graphics with geometric data sets.

This dataset was published last year to test the large model's understanding of Euclidean geometry, and Hsu tested it again after GPT-4v opened up.

It turns out that GPT-4v 's understanding of graphics seems to be "completely different from that of humans."

Statistically, GPT-4v 's answers to these geometric questions are also significantly lower than those of humans.

Paper address:

[1] https://arxiv.org/abs/2305.07141

[2] https://arxiv.org/abs/2311.09247

Reference link:

[1] https://news.ycombinator.com/item?id=38331669

[2] https://twitter.com/joycjhsu/status/1724180191470297458

This article is from the official account of Wechat: qubit (ID:QbitAI). Author: Creasy.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.