Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

The Tiangong University model reaches the top of the multimodal list! Solve the two difficult problems of hallucination and cross-language

2025-03-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Share

Shulou(Shulou.com)11/24 Report--

Domestic large model, to the top of the multimodal list!

Recently, Kunlun Wanwei can be described as "wind-boiled water" in the large model circle.

Just a few days ago, it was revealed that AI Daniel Yan Shuicheng had been poached to serve as joint CEO of Tiangong Intelligence.

Now, its "Tiangong" large model Skywork-MM is at the top of the multimodal list, ranking first in the multimodal large language model (Multimodal Large Language Model, referred to as "MLLM") test conducted by Tencent YouTu Lab and Xiamen University.

△ MME ranked first in perception, second in cognition, and first in overall list.

Tencent YouTu Lab and Xiamen University for the first time conducted a comprehensive quantitative evaluation of the global MLLM model on the newly-built benchmark MME and published 16 rankings, including two general lists of perception and cognition and 14 sub-lists.

The MME dataset is a recently released benchmark for multimodal language model evaluation.

It comprehensively evaluates the model through the performance of a large multimodal language model on 14 subtasks covering perceptual and cognitive tasks.

Skywork-MM, on the other hand, took the first place with less than 50m of graphic data-far less than other big models (> 100m) (see the end of the list).

How do you do that?

It mainly solves two perplexing problems in the existing multimodal large model.

Hallucinations and weak cross-language skills.

The so-called hallucination of multimodal difficulties means that large multimodal models tend to answer questions in the affirmative, even if there are no relevant features in the questions we give.

For example, face the following image.

If you ask it, "what color is this man's hair?" Even excellent multimodal models such as LLaVA and MiniGPT-4 will "lie with their eyes open": black.

Another picture: a glass, a teacup and a bathtub with a small goldfish in it.

If you ask it, "are all the objects in the picture yellow?" No one can answer correctly.

As for the problem of cross-language competence, it is mainly manifested in the unsatisfactory answers to the questions in the Chinese scene.

For example, when identifying whether the following picture is "Colorado Grand Canyon or Suzhou Garden", three bilingual multimodal language models: LLaVA, LLaVA-Chinese and ImageBind-LLm are all answered as the former.

It is even more complicated to ask them where they can enjoy the scenery.

Even sometimes, the model will reply directly in pure English.

These two problems seriously affect the performance of existing multimodal large models.

How to solve it?

Starting from three aspects, the Kunlun Universal Engineering University model Skywork-MM starts from three aspects: data, model and training process.

The focus is on data and models.

Let's look at the data first.

First of all, for hallucinations.

In essence, the main problem is that the data used in the whole model training process pays too much attention to positive samples.

In other words, the model describes what is in the picture, but does not learn what is in the picture.

If the weakly related picture and text data are encountered in the training process, the model will release more associations and form more serious hallucinations.

For this reason, the multimodal team of Tiangong University Model proposes that the image is the center and the feeding model contains both positive and negative samples of multimodal instruction fine-tuning data.

This enables the model to learn not only the visual features that exist in an image, but also the features that do not exist.

In this way, the command-following ability of the model is enhanced: what is asked and answered, and what is not available is not made up.

Secondly, there are two ways to solve the Chinese problem in cross-language:

(1) enhance the ability of following instructions in Chinese.

Since the "cultural gap of fine-tuning instructions is very small", you only need to translate the English instruction fine-tuning data constructed in solving the hallucination problem into Chinese.

(2) enhance the recognition ability of Chinese related scenes.

It should be noted that when solving cross-language problems, we focus on cultural bias--

That is to say, general visual features and language features can be associated through a common corpus, but the relationship between specific visual features and language features in each language and culture requires a lot of special learning.

So we need to add large-scale Chinese image-text pairs of data.

However, this kind of Chinese corpus is not easy to collect. First, it is limited by the quality of the data, but by the quantity.

What shall I do?

It leads to the improvement of Skywork-MM in the model architecture.

In order not to let the low-quality graphic data affect the effect of the model, the multimodal team chose to freeze the visual model and the large language model completely.

The purpose of this is to keep the visual features learned by the visual model in the pre-CLIP training and the language ability of the large language model not lost.

At the same time, in order to better relate the visual features and language features in different cultural environments, the model as a whole includes a learnable visual feature sampler and a LoRA adapter for the language model.

As shown in the following figure, Skywork-MM consists of four modules:

Given an image, LVM first extracts the image features, and then inputs the image features into the resampler to calculate the token that can be input for LLM.

LLM receives the token and instruction prompts, if any, and then outputs an image description or an answer to the question.

As for the training process, it is mainly divided into two stages:

In the first stage, bilingual large-scale pictures and texts are used to learn the image concepts and language concepts of the data.

In the second stage, the instruction fine-tuning is carried out by using multimodal fine-tuning data.

At this point, various types of instruction fine-tuning data (including positive and negative samples) form a unified Chat Prompt form.

Ps. The resampler and LoRA adapter in the image above mark the flame and they are trainable.

The top MME comprehensive list is shown in the following table. Skywork-MM uses a total of about 50 megabytes of graphic data, which is much less than the current large models of the same kind.

However, after the improvement of the above data, model and training process, the effect of Skywork-MM is outstanding.

As shown in the following figure:

It can accurately understand the abnormal behavior in the picture.

Can also understand some special instructions (answer questions according to options, write poems according to scenery, write advertising words, write acceptance speeches, etc.)

When it comes to Chinese scene problems, he no longer acts like a "foreigner".

It can be said to have a good command to follow and Chinese scene question and answer ability.

So like the hallucinations and cross-language problems shown in the first paragraph, it can come in handy:

If Meng Fei doesn't have hair, he won't say black. Suzhou Garden and if you are the one are recognized at a glance. None of the three objects is yellow.

As shown at the beginning, in horizontal tests with other models, Skywork-MM directly ranked first in the MME list, including perception No. 1 (with a 43-point gap from No. 2) and Cognition No. 2.

This list was launched around June this year, and the current GitHub 4k star is one of the latest benchmarks for the current multimodal large model.

It contains a total of 14 sub-tasks, of which the perceptual task includes not only OCR, but also coarse-grained and fine-grained object recognition, the former identifying the existence, quantity, location and color of objects, and the latter identifying movie posters, celebrities, scenes, landmarks and works of art.

Cognitive tasks include common sense reasoning, numerical calculation, text translation and code reasoning.

The following table shows Skywork-MM 's specific score on OCR + coarse-grained recognition in the list-aware task:

Fine-grained recognition score:

And cognitive task scores:

As you can see, only the MiniGPT-4 and BLIP series can be "on a par with Skywork-MM" occasionally.

In addition to the MME list, Skywork-MM does well on the development set of another multimodal benchmark MMBench:

Space for progress

It should be noted that although the latest achievement of the Kunlun Universal Heavenly Engineering Model represents the highest level of the current multimodal large model, it still has a lot of room for improvement.

For example:

Cultural and language barriers still exist, and it is necessary for us to develop a multilingual LVM to more effectively extract visual features unique to different cultures, or to collect more large-scale, high-quality image text pairs in various languages to ensure that the model accurately grasps the relationship between visual concepts and text concepts.

In addition, the current results are only based on a smaller scale (13B). If we study a larger multimodal model, we may need to further explore the use of data, parameter setting, training strategies and so on.

The evaluation benchmark can also be more comprehensive, and the current testing scope of MME and MMBench is limited.

And from the list of coarse-grained perceptual recognition tasks above, all the existing large multimodal models are lacking in the ability to accurately identify the location of objects according to pictures (which is of great significance to the realization of robot perception):

The highest model score is 33.33, a long way from the full score of 100.

This defect can also be seen in the following figure:

There is no doubt that the future of artificial intelligence must be multimodal.

All these problems show that we are just beginning to explore its true potential.

However, we believe that in the repeated ranking changes, the "ChatGPT moment" which belongs to the multimodal large model will eventually come.

Paper address:

Https://github.com/will-singularity/Skywork-MM/blob/main/skywork_mm.pdf

List address:

Https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

IT Information

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report