In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-30 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >
Share
Shulou(Shulou.com)11/24 Report--
[new Zhiyuan Guide] the University of Maryland has released its first benchmark test, HallusionBench, designed specifically for VLM, which comprehensively tests GPT-4V for visual errors and language hallucinations.
GPT-4 has been bragged so much that GPT-4V, as a visual version of GPT-4, is also highly expected by the public.
But if I tell you, the Pythagorean theorem, which is known to junior high school students, only applies to right triangles.
However, GPT-4V is confident that it can be used to calculate the length of bevel in obtuse triangles.
What's even more outrageous is that GPT-4V directly made a fatal safety mistake and even thought the red light could drive.
What on earth is going on?
The research team of the University of Maryland discovered these problems in the process of exploration, and on this basis put forward two main types of errors: language illusion and visual illusion, in order to explain the causes of these errors.
Links to papers: https://arxiv.org/ abs / 2310.14566
Project home page: https://github.com/ tianyi-lab / HallusionBench
Based on the above analysis, the researchers created an image-context reasoning benchmark called HallusionBench, which aims to explore the complexity of image and context reasoning.
Based on their tests of visual ability, GPT4V had an error rate of nearly 90% in answering visual questions.
The researchers also conducted a detailed study of the newly released GPT-4V (ision) and LLaVA-1.5 and analyzed their ability in visual comprehension.
HallusionBench is the first benchmark designed specifically for VLM, focusing on visual illusions and knowledge hallucinations. The test included about 200 visual quizzes, nearly half of which were created by manual experts.
At present, the data is open source and is still being updated.
A variety of picture types are involved, including original illusion pictures, charts, maps, posters, videos and manually produced or modified pictures, covering many fields such as mathematics, counting, culture, animation, sports and geography.
In this paper, the author preliminarily describes two kinds of visual problem classification in HallusionBench: visual dependence (Visual Dependent) and visual supplement (Visual Supplement), and discusses the design method of the experimental control group.
They then analyzed the two main causes of wrong answers: visual illusion (Visual Illusion) and verbal illusion (Language Hallucination).
At the end of the article, the author shows in detail the failure cases in each major category through different subcategories, and makes an in-depth analysis.
Key points:
1. "language hallucination": misleads 90% of sample reasoning in GPT-4V and LLaVA-1.5. The delicate balance between vision and language is crucial!
2. "Visual illusion": the visual module in LVLMs is easily affected by complex visual context, and the errors of language model are exaggerated.
3. Simple image modification can deceive GPT-4V and LLaVA-1.5, exposing the need for more powerful image analysis capabilities.
4. GPT-4V has difficulty in reasoning the time relationship between multiple images.
5. LLaVA-1.5 sometimes makes mistakes in common sense queries and needs to improve its language model a priori.
Visual problem types Visual dependent problems (Visual Dependent):
The answer to this kind of question depends entirely on the visual content, and it is impossible to answer exactly when there is no image information.
These problems are usually related to the image itself or what it displays. For example, in the absence of an image, there is no accurate answer such as "is the orange circle on the right the same size as the one on the left?" "something like that.
Visual complementary problem (Visual Supplement):
These questions can be answered even without visual content. In this type of problem, visual elements provide only additional information.
For example, even without picture assistance, GPT-4V can still answer, "is New Mexico bigger than Texas?" "and so on.
The core of the test is to determine whether GPT-4V and LLaVA-1.5 can use image content to answer, rather than relying solely on their parameterized memory.
The author analyzes the wrong answer and divides the reasons into two categories:
Visual error (Language Hallucination):
This kind of error results from the wrong visual recognition and interpretation of the input image. The model fails to extract accurate information from the image or infer it correctly.
Language hallucination (Visual Illusion):
Based on its parameterized knowledge base, the model makes inappropriate preconceived assumptions about problem input and image background. The model should respond to the specific environment of the problem, rather than ignoring the problem itself or misinterpreting the image.
As can be seen from the classic case of visual illusion shown in figure 1, GPT-4V shows a richer knowledge reserve than LLaVA-1.5 in recognizing various illusion images and their names.
Figure 1 however, when answering edited image-related questions, GPT-4V failed to provide an accurate answer.
This phenomenon may be due to the fact that GPT-4V relies more on its parameterized storage knowledge than on the actual analysis of images.
On the contrary, whether dealing with the original image or the edited image, the performance of LLaVA-1.5 is relatively poor, which reflects the limited ability of LLaVA-1.5 in visual recognition.
Looking at the samples provided in figure 2, we can find that both GPT-4V and LLaVA-1.5 fail to correctly identify parallel lines, regular triangles, polygons and other mathematical theorems.
This phenomenon reveals that GPT-4V still faces great challenges in dealing with geometric and mathematical problems.
Figure 2 in the display in figure 3, the author points out several posters showing some well-known local cuisine, but the geographical features of these delicacies have been changed.
In the face of such a scene, both GPT-4V and LLaVA-1.5 fail to fully consider the context information, ignore the content of the image, and continue to answer relevant questions according to the well-known origin mentioned in the text.
Figure 3 in the case of figure 4, the author further discusses the ability to process multiple image sequences.
The order and reverse order of pictures often show opposite meanings in semantics, such as "appear and disappear" and "backward and forward".
The comparative study in figure 4 shows that although these image sequences depict different dynamics, GPT-4V is still unable to distinguish between the order and reverse order of these images.
This finding points out that GPT-4V still needs to be greatly optimized and improved in video sequence reasoning.
Figure 5 shows a case in which GPT-4V provides a deterministic answer in the absence of image background information.
Figure 5 in contrast, LLaVA-1.5, due to a lack of understanding of the text, presents a technically correct answer that has nothing to do with the question.
When the modified π value is used as the visual input, neither of the two models can correctly identify and interpret the value from the image.
The situation in figure 6 shows that both GPT-4V and LLaVA-1.5 can answer accurately and conclusively when visual input is missing.
Figure 6. However, when the table is used as visual input, GPT-4V tries to answer based on visual information, but mistakes the wrong data.
For example, GPT-4V mistakenly replied, "China has won 36 gold medals," although the chart actually shows that the United States has won these gold medals.
In contrast, LLaVA-1.5 relies more on its parameterized memory and behaves differently when dealing with problems and tables respectively.
In the scenario in figure 7, even without visual aids, both GPT-4V and LLaVA-1.5 give a definitive answer, of which GPT-4V 's answer is more accurate and accurate.
Figure 7 when introducing a chart as a visual input, GPT-4V can accurately give answers based on the data in the chart, while LLaVA-1.5 relies on its parameterized knowledge to answer.
But once the chart is flipped, GPT-4V 's prediction of the answer changes fundamentally. This error can be explained by visual illusion.
According to figure 8, in the absence of image support, both GPT-4V and LLaVA-1.5 provide a definite answer, but the correct answer is only given by GPT-4V.
Figure 8 can be inferred that GPT-4V is superior to LLaVA-1.5 at the knowledge level.
However, when the visual presentation of the map changes, the two models can not correctly infer the relative positions of the four states because of their strong parameter memory ability.
In recent years, with the rapid development of large-scale language models and multimodal research, the field of artificial intelligence has undergone great changes.
The combination of natural language processing (NLP) and computer vision (CV) not only promotes the birth of large visual language model (LVLM), but also significantly improves the performance of image reasoning tasks.
However, LVLM still faces some challenges, such as language illusion and visual illusion.
By launching HallusionBench, this study aims to provide a benchmark for VLM, especially in complex situations that are prone to failure due to verbal or visual illusions.
We have explored in depth the different examples and failures of GPT-4V and LLaVA-1.5, including:
1. In HallusionBench, GPT-4V and LLaVA-1.5 are often affected by language hallucinations when dealing with problems with prior knowledge. These models are more likely to rely on prior knowledge, resulting in more than 90% of the answers in our analysis are wrong. Therefore, the model needs to find a balance between parameterized memory and input text images.
two。 Even if GPT-4V and LLaVA-1.5 lack parameterized memory or prior knowledge, they are still vulnerable to visual illusion. These models often give wrong answers when dealing with geometric figures, mathematical images, video (multi-image scenes), complex charts and so on. At present, the ability of visual language model in visual processing is still very limited.
3. GPT-4V and LLaVA-1.5 are easily misled by some basic image operations in HallusionBench, such as image flipping, reverse order, occlusion, object editing, color modification and so on. The current visual language model has not been able to deal with these image operations effectively.
4. Although GPT-4V supports multi-graph processing, it fails to show effective temporal reasoning ability when analyzing multi-image problems involving time cues, and does not perform well in HallusionBench.
5. In the test of HallusionBench, LLaVA-1.5 sometimes makes some basic mistakes because of its relatively few knowledge base.
The authors say their dataset is open source and is continuing to expand the database. The latest data will be constantly updated on Github (https://github.com/ tianyi-lab / HallusionBench).
This study lays the foundation for a more powerful, balanced and accurate LVLM in the future, and looks forward to providing some possible directions for future research through these detailed case studies.
Reference:
Https://arxiv.org/abs/2310.14566
This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era)
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.