Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Open an eye to the big language model, look at the picture and speak better than CLIP, Stanford and other new methods without multimodal pre-training

2025-02-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Share

Shulou(Shulou.com)11/24 Report--

A large language model can read pictures without multimodal data.

Don't say much, just look at the effect.

Take a photo of the Great Wall that has been tested with BLIP-2. It can not only identify the Great Wall, but also tell two sentences of history:

Then there is a grotesque house, which can also accurately identify anomalies and know how to get in and out:

Deliberately make "Red" purple, but "Green" painted red does not interfere with it:

This is the recognition effect of LENS 🔍 (Language-Enhanced Neural System), a new modular framework proposed by researchers recently.

What is important is that there is no need for additional pre-training on multimodal data sets, and only ready-made large language models can be used to complete the tasks of target recognition and visual reasoning.

Save both money and effort!

The researchers said:

In the case of zero samples, the effect of this method can be comparable to the multi-modal large model Kosmos and the end-to-end joint pre-training model such as Flamingo, which can be used out of the box, and the performance may even be better.

Netizens are not calm when they see this:

Excited, family! The resources used to train large models can now also be used to solve problems in different areas. 😃

Another netizen said:

It is interesting to see which module can improve the ability of visual reasoning.

How did you do that? Although the existing LLM is excellent in natural language understanding and reasoning, it can not directly solve the task of reasoning from visual input.

The work, done jointly by Contextual AI and Stanford University researchers, uses LLM as a frozen language model (no more training or fine-tuning) and provides them with text information obtained from "visual modules" to enable them to perform target recognition and visual (visual and language) tasks.

To put it simply, when you ask about the content of a picture, the method will first operate three independent "visual modules", Tag Module (extract tag information), Attribute Module (extract attribute information), and Intensive Captioning Module (generate a detailed image description), in order to extract text information about the image.

Then input this information directly into the inference module (Reasoning Module), that is, the frozen LLM, to respond to the question.

In this way, by integrating LENS, we can get a cross-domain model that can be applied automatically without additional pre-training. And can make full use of the latest progress in the field of computer vision and natural language processing to maximize the advantages of these fields.

Previously, several methods of using LLM to solve visual tasks have been proposed.

One way is to train a visual encoder and then represent each image as a continuous embedded sequence that LLM can understand.

Another approach is to use frozen visual encoders that have been trained for comparison, while introducing new layers into the frozen LLM and training these layers from scratch.

The third method is to use both frozen visual encoders (contrast pre-training) and frozen LLM, and train lightweight transformer to align them.

A visual encoder is a model or component used to convert visual input, such as an image or video, into a representation vector. It can transform high-dimensional visual data into low-dimensional representation, and transform visual information into a form that can be understood and processed by language models.

It is obvious that all three methods require multimodal pre-training with data sets.

The comparison of △ visual and linguistic modal alignment methods, (a) represents the three methods mentioned above (b) is the LENS method, 🔥 represents training from scratch, and ❄️ represents pre-training and freezing LENS provides a unified framework to enable the "inference module" of LLM to operate from the text data extracted by the "visual module".

In the three "visual modules", for the tag module, the researchers collected a diverse and comprehensive vocabulary of tags. It includes a plurality of image classification data sets, target detection and semantic segmentation data sets, and visual genome data sets. In order to accurately identify and assign tags to the image, the researchers also used a CLIP visual encoder.

The general prompts for this module are:

"A photo of {classname}"

In the visual module used to extract attribute information, GPT-3 is used to generate visual feature descriptions to distinguish the categories of each object in the object vocabulary. And a contrast pre-trained CLIP visual encoder is used to identify and assign relevant attributes to the objects in the image.

In the visual module of detailed description information, the researchers use BLIP's image subtitle model and apply random top-k sampling to generate N descriptions for each image. These diversified descriptions are passed directly to the "reasoning module" without any modification.

In the final reasoning module, LENS can be integrated with any LLM to integrate the extracted information in the following format:

Tags: {Top-k tags}

Attributes: {Top-K attributes}

Captions: {Top-N Captions}.

OCR: this is an image with written "{meme text}" on it.

Question: {task-specific prompt}\ n Short Answer:

It is worth mentioning that emojis are also taken into account, so the researchers added an OCR hint.

The performance is better than CLIP in order to demonstrate the performance of LENS, the researchers experimented with 8 NVIDIA A100 (40GB) graphics cards, and the frozen LLM is Flan-T5 model by default.

For visual tasks, the researchers evaluated eight benchmarks and compared them with the latest models in the field of target recognition under zero and small sample settings.

The zero sample results of △ LENS in the target recognition task can be seen in the table above. In the case of zero sample, the LENS composed of ViT-H / 14 as the visual backbone and Flan-T5xxl as the frozen LLM is 0.7% higher than CLIP on average. Other combinations of LENS also outperform CLIP in most cases.

Interestingly, the researchers found in the target recognition task:

There seems to be no direct relationship between the size of frozen LLM and classification performance. On the other hand, there is a corresponding relationship between the size of the label generation architecture (ViT backbone) and performance.

The average performance of LENS on visual tasks under △ with fewer samples. As shown in the figure above, the researchers also plotted the average visual performance of all data sets except ImageNet and observed:

More samples can help improve performance. At the same time, there is no direct relationship between the performance of frozen LLM and visual performance, but a better visual backbone helps to improve the average visual performance.

For visual and language tasks, the researchers evaluated four representative visual question-and-answer tasks and compared them with the latest models that require additional pre-training to align visual and language modes.

In the zero sample setting, compared with the most advanced methods of VQAv2, OK-VQA, Rendered-SST and Hateful Memes, LENS performance can still compete with the method of alignment pre-training that relies on a large amount of data. This is true even compared to larger and more complex systems such as Flamingo, BLIP-2, and Kosmos.

Although LENS performs well in most cases, there are some failures:

The researchers believe that:

The visual ability of LENS depends heavily on its underlying visual components. There is room for further improvement in the performance of these models, and their advantages need to be combined with LLM.

Portal:

[1] https://huggingface.co/ papers / 2306.16410 (paper link)

[2] https://github.com/ ContextualAI / lens (open source code)

This article comes from the official account of Wechat: quantum bit (ID:QbitAI), author: Xifeng

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

IT Information

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report