In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-31 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >
Share
Shulou(Shulou.com)12/24 Report--
The multimodal large language model shows strong image understanding and reasoning ability. However, it is still very difficult for them to predict and reason future events based on current observations.
Even the most powerful GPT-4V today (shown in the following figure) does not solve this problem very well.
The error case of △ GPT-4V now, the teams of China University of Science and Technology and Shanghai University of Science and Technology have proposed a learning paradigm that gives forward thinking to the multimodal large language model, and based on this paradigm, the multimodal large language model Merlin (Merlin) has been constructed.
Merlin (Merlin) is a legendary figure in the Arthurian legend, famous for his powerful magic and wisdom. It is said that Merlin has the ability to foresee the future and has a deep understanding of fate.
Let's take a look at how it is done.
Note: human beings can infer events that are about to happen or may occur in the coming period of time from the current observed state, and we call this ability forward thinking.
A simple example:
When you watch a NBA game on TV, you can judge what may happen next according to the state of different players on the field.
For example, when an offensive player breaks through the defender with the ball, we have reason to judge that the player is about to rush to the basket for layup or dunk.
For example, when the ball holder stops at the three-point line and faces the basket, we have reason to predict that the player is about to shoot a three-point shot (of course, it may also be a fake to shake off the defender to break through).
This kind of prediction can be made by the large Merlin model.
Methods in order to explore how to endow forward-looking thinking to stimulate multimodal large language model.
First of all, we have an in-depth analysis of how human beings predict future events.
We regard human reasoning and prediction of future events as a two-stage system.
In the first stage, we will observe the current scene, and focus on capturing the dynamic clues of the relevant subjects in the process of observation. in the second stage, our brain will analyze the behavior pattern (such as running or running, etc.) and behavior intention of the subject according to the dynamic clues, and then deduce the possible events that are about to happen.
For the standard multimodal large language model, we think that the second stage can be completed well, which is due to the strong logical reasoning ability of the large language model.
So the problem lies in the first stage, that is, the current multimodal large language model is difficult to successfully capture the dynamic information of relevant subjects, which limits its ability to reason about future events.
After getting this conclusion, the next thing we need to do is to explore how to make the multimodal large language model learn to capture the dynamic clue information of the relevant subjects from the current observations.
In order to achieve this goal, a direct solution is to let the multimodal large language model learn to predict all the information of the next frame (that is, to reconstruct the next frame as the optimization goal).
However, on the one hand, it is difficult to learn, on the other hand, there are a lot of redundant visual information in the image or video sequence, which is not conducive to the model to learn to capture the dynamic information of the corresponding subject.
Based on the above analysis, this paper proposes a structural representation of "trajectory" as the optimization objective to establish the dynamic relationship between the past and the future. We believe that taking trajectory as the optimization goal has the following advantages:
The main results are as follows: (1) as a highly structured representation, trajectory has strong information condensability, which can help the model to effectively extract the key dynamic information of the subject in continuous action, thus reducing the learning demand for redundant visual information. the computational cost is lower.
(2) the trajectory can naturally associate the past with the future. By learning to predict the trajectory of the subject, the multimodal large language model must learn to accurately pay attention to the corresponding position of the corresponding subject in different frames, which can greatly enhance the alignment ability of the model multi-graph and multi-identity (Id).
Based on these advantages, we design a new learning framework, which focuses on extracting, understanding and predicting the motion trajectory of the subject from multi-modal input (such as image, video and text). The framework is as follows:
Inspired by the current mainstream LLM learning paradigm, we have also constructed a two-stage learning paradigm, namely prospective thinking pre-training (Foresight Pre-Training, FPT) and prospective thinking instruction fine-tuning (Foresight Instruction-Tuning, FIT).
In FPT, we first input the visual context tokens containing several frames of pictures to the model, then we give the initial observation (initial position, apparent description or action description) of the first frame of the relevant subject, and then we require the model to predict the whole trajectory of the corresponding subject according to the initial observation.
By learning to predict the whole trajectory, the model must learn to correctly pay attention to the corresponding subjects in multiple graphs and capture their dynamic information.
On the other hand, in FIT, some relevant user prompt will be added to carry on the conversation about the relevant subject.
It is worth noting that in order to stimulate the forward-looking thinking of the model at this stage, we also design a form of instruction interaction with "trajectory" as the core, which we call trajectory thinking chain technology (Trajectory Chain-of-Thought,T-CoT).
Specifically, when talking to the model, we ask the model to output the tracks of the relevant subjects mentioned (as shown in the figure above).
By outputting the whole trajectory, the model is forced to pay attention to the corresponding subjects in the multi-graph, so as to provide sufficient dynamic information for subsequent future event reasoning. For more details of the method, please read the paper.
After the data structure has designed our learning paradigm, it is more important to build appropriate data for the model to learn. We have carefully constructed a complete set of multi-task learning data based on the open source data on the market. The data distribution is as follows:
It mainly includes Caption,Referring,Detection,Tracking,Reasoning and Dialogue data * indicating that the data is only used in the instruction fine-tuning phase (FIT).
Here, Merlin uses FPT data constructed from tracking data for the first time to give the model trajectory perception and prediction ability.
On the other hand, we also put forward the technology of precise task and output formal prompt (Precise Definition of Task Prompt and Answer Format):
The conflict between multi-task learning and the damage to general multimodal ability can be avoided by telling the large model the specific task and output form.
Our follow-up experiments also show that by using this technology, large models can take into account both learning multitasking proprietary capabilities and general multimodal capabilities.
With the combination of the above two learning processes and the constructed high-quality data, we have built a new general multimodal large language model, Merlin.
Merlin can support the input of single image or multi-frame image sequences, and can complete a series of tasks, including detection, tracking, REC,REG and so on.
At the same time, thanks to our proposed FPT and FIT,Merlin show a strong trajectory-based future reasoning ability, here we choose some cases to show the ability of Merlin, more test results please read our paper and the subsequent open demo.
Experimental Analysis in order to comprehensively evaluate all aspects of the capabilities of Merlin, we have designed a series of performance comparison experiments and property exploration experiments. Here we focus on selecting several enlightening experiments to share. For more experimental details, please read our paper.
1. Future reasoning (Future Reasoning) evaluation
Because there is no mature benchmark in the current field that can evaluate multimodal large language models, this work builds a new set of Future Reasoning Benchmark based on MMBench.
On this benchmark, Merlin significantly surpasses the existing mainstream multimodal large-scale models and shows a strong future reasoning ability.
2. Trajectory correlation and prediction evaluation.
As Merlin regards predicting the trajectory of relevant subjects based on initial observations as a core learning goal in pre-training, in order to evaluate this learning situation more comprehensively, we focus on the downstream task of tracking for evaluation.
This is because trajectory correlation is a core sub-task of tracking task, and the evaluation index of tracking can reflect the alignment ability of multi-graph and multi-id of large model to some extent.
From the results, we can see that Merlin, as a general multimodal large language model, even surpasses some expert models in tarcking tasks. At the same time, it is worth noting that this is the first time that multimodal large language models can perform tracking-related tasks.
3. Hallucination evaluation
Hallucination is an important research topic in the field of large models. Due to the introduction of visual modes into multimodal large language models, bias caused by the lack of accurate alignment between subject description and corresponding visual information also brings more serious hallucinations.
In this paper, we evaluate the hallucination of Merlin on POPE to evaluate the alignment ability between images and texts of the model. As shown in the following table:
It can be seen that Merlin shows a strong anti-hallucination ability, which is significantly ahead of the current mainstream multimodal large language models, which proves that our proposed forward-thinking training paradigm can enhance the "map recognition" ability of the model, so that the model can reduce the misrecognition of picture content and the inconsistency between pictures and text.
4. Multimodal comprehensive performance evaluation.
Merlin is also evaluated on the current mainstream multimodal large language model synthesis capabilities (including MMBench and MMVet) and visual question answering capabilities (including GQA and VisWiz) evaluation Benchmark.
The evaluation results show that Merlin has achieved very competitive results, showing the strong general comprehensive ability of Merlin.
5. Visual analysis
In order to more intuitively show how Merlin captures dynamic information clues, this paper also carries out an interesting visualization experiment. For a specific dialogue question and answer, we visualize the attention between the word embedding of the trajectory coordinates output of the model and the visual tokens of multi-frame images, as shown in the following figure:
We can see that the word embedding of the estimated coordinates of the model output can accurately focus on the corresponding target subject in the corresponding frame.
This visualization result further proves that "trajectory" is a very good intermediate representation to help the multimodal large language model establish the dynamic relationship between the language description and the corresponding subjects of multi-frame images.
This also explains why Merlin has strong multimodal synthesis ability and anti-hallucination ability from another point of view.
The work of thinking about and summarizing Merlin shows us the important role of the structured representation of "trajectory" in helping multimodal large language models to have forward thinking.
From this point, we can further consider what role bounding box and trajectory play in the learning of multimodal large language models.
Is it as an intermediate form or as a separate learning optimization goal?
On the other hand, is the existing coordinate coding reasonable, and is there any representation that is more suitable for natural language?
I think there is no standard answer to these at present, and researchers need to explore further. Finally, I hope that the work of Merlin can bring some new thinking and understanding to the multimodal large model community. We also welcome you to continue to pay attention to our work and communicate more.
Thesis:
Https://arxiv.org/pdf/2312.00589.pdf
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.