Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

The Chinese team caught fire in InstructBLIP, looking at pictures and chatting, and open source projects swept a number of SOTA.

2025-04-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Share

Shulou(Shulou.com)11/24 Report--

Original title: Beyond GPT-4! Chinese team explodes InstructBLIP to rush to see pictures and chat, open source projects sweep many SOTA

A multimodal generation model that crushes GPT-4's ability to read images has arrived. The latest InstructBLIP proposed by the Chinese team implements SOTA on multiple tasks.

GPT-4 was surpassed even before it was launched.

Recently, the Chinese team open-sourced the multimodal base model InstructBLIP, which is a fine-tuned model from the BLIP2 model.

A new member has been added to the BLIP family: InstructBLIP According to reports, InstructBLIP model is better at "seeing","reasoning" and "speaking", that is, it can understand, reason, describe complex images, and support multiple rounds of dialogue.

For example, what might have happened in the scene below?

InstructBLIP reasoned that it could be caused by hurricanes or bad weather factors.

Tell me about this painting.

Have multiple rounds of dialogue

The researchers say it is based on powerful BLIP-2 that makes InstructBLIP "see" better.

Most importantly, InstructBLIP achieves state-of-the-art performance on multiple tasks, even outperforming GPT4 in picture interpretation and reasoning.

Why is it so strong?

The main significance of InstructBLIP is to solve the challenges in visual-language instruction fine-tuning and to systematically study the improved generalization ability of data and tasks that the model has not seen before.

Paper address: arxiv.org/ In the paper, the researchers first introduced the construction of instruction fine-tuning data, and then the specific training process.

After that, two techniques to improve instruction tuning performance are described from the perspective of model and data respectively.

To preserve the diversity of instruction fine-tuning data, while also taking into account their accessibility, the researchers collected a large number of publicly available visual language datasets and converted them into instruction fine-tuning formats.

In the figure below, the data finally collected by the researchers covered 11 task categories and 28 data sets.

These include image captioning, image captioning with reading comprehension, visual reasoning, image question answering, knowledge-based image question answering, image question answering with reading comprehension, image question generation (as opposed to QA datasets), video question answering, visual dialog question answering, image classification, and LLaVA-Instruct-150K.

For each task, the researchers created 10-15 different instruction templates in natural language. These templates are the basis for building command fine-tuning data that clarifies tasks and delineates goals.

For common datasets that inherently favor short responses, the researchers used shorter terms in some of the corresponding instruction templates to reduce the risk of model adaptation always generating short responses.

For the LLaVA-Instruction-150K dataset, the researchers did not add an additional instruction template because it is a naturally structured instruction format.

Existing methods for generating zero-sample images to text, including BLIP-2, take methods unrelated to instructions when extracting visual features.

That is, the visual input in LLM is instruction unaware, which does not contribute to the flexibility of the model in different tasks.

In contrast, visual models of instruction perception can improve the ability of models to learn from different instructions.

For example, consider two cases: input the same image, the model is asked to complete two different tasks; and, given two different images, the model is instructed to complete the same task.

In the first case, an instruction-aware visual model can extract different features from the same image according to instructions, presenting more informative features when solving different tasks.

In the second case, an instruction-perceived visual model can extract features from two different images using common knowledge embodied in the instructions, thereby achieving better information transfer between images.

InstructBLIP proposes a visual feature extraction method based on instruction perception by fully utilizing Q-Former architecture in BLIP-2 model.

As shown above, Q-Former is designed to extract visual features from the output of a frozen image encoder.

According to the BLIP-2 paper, Q-Former has been pretrained in two stages, and through pretraining, it has learned to extract visual features of text alignment that can be digested by LLM.

In reasoning, an instruction is attached to a visual cue that directs the LLM to perform different tasks as prescribed.

In InstructBLIP, the instruction text is given not only as input to LLM, but also to QFormer.

Due to the large number of training datasets and the large variation in size of each dataset, mixing these datasets evenly can cause the model to overfit on smaller datasets and underfit on larger datasets.

To alleviate this problem, the researchers recommend sampling according to the size of the dataset (i.e., the number of training samples) and performing square-root smoothing. In general, given the size of D data sets:

During training, the probability that a data sample is selected from data set d is expressed as,

In addition to this weighting formula, the researchers manually fine-tuned the weights for certain data sets to improve their convergence.

This is necessary because the inherent differences in various datasets and tasks require different levels of training intensity, even if they are of similar size.

Specifically, the researchers decreased the weight of A-OKVQA (multiple choice) and increased the weight of OKVQA.

The researchers first evaluated InstructBLIP on 13 datasets and compared InstructBLIP to previous SOTA models BLIP-2 and Flamingo.

As shown in the table, InstructBLIP achieved new zero-sample SOTA results on all data sets.

BLIP-2 was exceeded on all LLMs, indicating the effectiveness of visual instruction fine-tuning.

In addition, instruction fine-tuning improves zero-sample generalization for unseen task categories such as video QA.

Although it was never trained with temporal video data, InstructBLIP improved 47.1% on MSRVTT-QA over the previous SOTA.

Finally, the researchers evaluated with the smallest InstructBLIP FlanT5XL (4B) on all six shared evaluation datasets, outperforming Flamingo-80B with an average relative improvement of 24.8%.

To investigate the effects of command-perceived visual feature extraction and dataset balancing strategies, the researchers conducted ablation studies by removing them separately during the command-tuning process.

In all datasets, lack of instruction perception in visual features significantly degrades performance. This performance degradation is more severe in datasets involving spatial visual inference (such as ScienceQA) or temporal visual inference (such as iVQA).

In these data sets, input instructions to Q-Former can be used to direct it to pay more attention to more informative image embeddings.

Regarding the data balancing strategy, removing it leads to unstable training patterns, as different data sets achieve optimal performance in significantly different training steps. Therefore, this instability compromises overall performance.

Qualitative assessment In addition, the researchers conducted further qualitative studies of Instruct-BLIP using more diverse images and descriptions.

For example, use a diagram from the GPT-4 technical report. "What's wrong with this picture?"

From the answers given, InstructBLIP is more comprehensive than GPT-4, more visual than LLaVA, and more logical than MiniGPT-4.

InstructBLIP answers the question of who painted the Mona Lisa very briefly.

Here, researchers argue that long responses are not always desirable. Instruct-BLIP can directly address user intent by adaptively adjusting response length.

Other models tend to generate longer paragraphs and less relevant sentences.

Instructlip achieves these benefits as a result of using diverse instructions to tune data and effective architectural design.

In addition, it is found that instruction adjustment is the key to improve the zero-sample generalization ability of the model.

Comparison of Instruction Optimization and Multitask Training Based on BLIP-2 FlanT5XL

In addition, the researchers further fine-tuned the InstructBLIP model to study its performance on specific datasets of learning.

Compared to most previous methods (e.g. Flamingo, BLIP-2), InstructBLIP maintains the same image resolution (224×224) during instruction fine tuning and maintains the frozen state of the visual encoder during fine tuning.

This greatly reduces the number of trainable parameters from 1.2B to 188M, thus greatly improving the fine-tuning efficiency.

Wenliang Dai Wenliang Dai is a PhD student at the Hong Kong University of Science and Technology supervised by Professor Pascale Fung. Prior to this, he received a Master's degree from University College London and a Bachelor's degree in Computer Science from Nottingham University.

Junnan LiSalesforce Asian Research Institute Scientist, Bachelor of Electronic Engineering, University of Hong Kong, Doctor of Computer Science, National University of Singapore, main research interests: computer vision and deep learning, unsupervised learning, weakly supervised learning, transfer learning and social scene understanding.

Other SOTA users said that recently proposed multimodal models similar to InstructBLIP include MiniGPT-4 and LLaVA.

MiniGPT-4 can also chat with pictures, such as sending a picture of a seafood dinner, you can directly get the recipe.

For MiniGPT-4, it can achieve such good results, but it is not complicated to implement.

Integrating the image encoder with the open source language model Vicuna, and freezing most of the parameters of both, requires only a small part of training.

In addition, the team had MiniGPT-4 collaborate with ChatGPT to create a high-quality dataset of 3500 image texts, which was also open sourced.

There is also LLaVA, which is trained on a small multimodal instruction dataset, but which shows reasoning results very similar to multimodal model GPT-4 on some examples.

References:

https://twitter.com/LiJunnan0409/status/1656821806593101827

https://arxiv.org/abs/2305.06500

https://github.com/salesforce/LAVIS/tree/main/projects/instructblip

This article comes from Weixin Official Accounts: Xinzhiyuan (ID: AI_era)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

IT Information

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report