Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Even the hundred-year-old outline has been straightened out. Microsoft Multimodal Universe has completed the IQ test with only 1.6 billion parameters.

2025-02-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Share

Shulou(Shulou.com)11/24 Report--

Microsoft Asia Research Institute released a multi-modal large-scale language model KOSMOS-1 with only 1.6 billion parameters, which can not only answer pictures, but also solve the Raven IQ test.

The volume of the big model, I can't catch up with the progress without sleeping...

Microsoft Research Asia has just released a Multimodal Large Language Model (MLLM)- KOSMOS-1.

Address: arxiv.org/ pdf/2302.14045.pdf Title Language Is Not All You Need, also from a famous saying.

There is a saying in the article,"The limits of my language are the limits of my world." Ludwig Wittgenstein, Austrian philosopher

Then the problem came...

Ask KOSMOS-1 "Is it a duck or a rabbit?" This more than 100-year-old map simply does not make Google AI whole.

In 1899, American psychologist Joseph Jastrow first used the "duck rabbit diagram" to show that perception is not only what people see, but also a psychological activity.

KOSMOS-1 can now combine this perception with language models.

- What's in the picture?

- Like a duck.

- If it's not a duck, what is it?

- Looks more like a rabbit.

- Why not?

- It has rabbit ears.

KOSMOS-1 is a bit like Microsoft ChatGPT.

Kosmos-1 can also understand images, text, images with text, OCR, image description, visual QA.

Even IQ tests are not a problem.

Kosmos is derived from the Greek word cosmos, which means "universe."

According to the paper, the latest Kosmos-1 model is a multimodal large-scale language model.

The backbone is a causal language model based on Transformer, in addition to text, other modalities such as vision and audio can be embedded in the model.

The Transformer decoder acts as a universal interface for multimodal inputs, so it can sense general modalities, learn context, and follow instructions.

Kosmos-1 achieved impressive performance without fine-tuning on language and multimodal tasks, including image recognition with text instructions, visual question answering, and multimodal dialogue.

Here are some example patterns generated by Kosmos-1.

Picture interpretation, picture Q & A, web page question answering, simple number formulas, and number recognition.

So, on what data sets was Kosmos-1 pre-trained?

The databases used for training include text corpora, image-caption pairs, and image and text intersection datasets.

Text corpus from The Pile and Common Crawl (CC);

The sources of image-caption pairs are English LAION-2B, LAION-400M, COYO-700M and Conceptual Captions;

The source of the text crossover dataset is the Common Crawl snapshot.

Now that the database is available, the next step is to pretrain the model.

The MLLM component has 24 layers, 2,048 hidden dimensions, 8,192 FFNs, and 32 attention heads, yielding approximately 1.3B parameters.

Magneto initialization is used to ensure stability of optimization; for faster convergence, the image representation is obtained from a pre-trained CLIP ViT-L / 14 model with 1024 feature dimensions. During training, the images are preprocessed to 224×224 resolution, and the parameters of CLIP model are frozen except for the last layer.

The total number of parameters for KOSMOS-1 is approximately 1.6 billion.

In order to make KOSMOS-1 better consistent with instructions, it was subjected to language-only instruction tuning [LHV+23, HSLS22], i.e. the model was continued to be trained with instruction data, which is the only language data mixed with the training corpus.

The tuning process is carried out according to language modeling, and the instruction data sets selected are Unnatural Instructions [HSLS22] and FLANv2 [LHV+23].

The results show that the improvement of instruction following ability can be transferred across modes.

In summary, MLLM can benefit from cross-modal transfer, transferring knowledge from language to multimodal and vice versa;

There are 10 tasks in 5 categories, all of which are good or not. Take them out and you will know.

The research team conducted experiments to evaluate the performance of KOSMOS-1 from multiple perspectives, including ten tasks in five categories:

1 Language tasks (language understanding, language generation, text classification without OCR)

2 Multimodal transitions (common sense reasoning)

3 Nonverbal reasoning (IQ test)

4 Perception-language tasks (image description, visual question answering, web question answering)

5 Visual tasks (zero sample image classification, zero sample image classification with description)

Text classification without OCR This is a text and image focused understanding task that does not rely on optical character recognition (OCR).

The accuracy of KOSMOS-1 for HatefulMemes and Rendered SST-2 test sets is higher than other models.

Furthermore, Flamingo explicitly provides OCR text to the prompt, KOSMOS-1 does not access any external tools or resources, which demonstrates KOSMOS-1's inherent ability to read and understand text in rendered images.

The Raven IQ test is one of the most commonly used tests for assessing nonverbal abilities.

KOSMOS-1 improved accuracy by 5.3% over random selection without fine-tuning and by 9.3% with fine-tuning, indicating its ability to perceive abstract conceptual patterns in non-verbal environments.

This is the first time that a model has been able to complete the zero-sample Raven test, demonstrating the potential of MLLMs for zero-sample nonverbal reasoning by combining perception with language models.

Image illustration KOSMOS-1 performs well on both COCO and Flickr 30k zero-sample tests, scoring higher than other models, but using smaller parameter counts.

In performance tests with fewer samples, scores increase with increasing k values.

Zero Sample Image Classification Given an input image, connect that image to the prompt "The photo of the." Then, enter the model to get the category name of the image.

By evaluating the model on ImageNet [DDS+09], KOSMOS-1 performs significantly better than GIT [WYH+22] in image classification with and without constraints, demonstrating a strong ability to perform visual tasks.

Visual common sense reasoning tasks require models to understand attributes of everyday objects in the real world, such as color, size, and shape, and these tasks are challenging because they may require more information about object attributes than text.

The results show that KOSMOS-1 is superior to LLM model in both size and color. This is mainly due to KOSMOS-1's multimodal transfer capability, which enables visual knowledge to be applied to language tasks without having to rely on textual knowledge and cues for reasoning, as LLM does.

For Microsoft Kosmos-1, netizens praised that in the next five years, I can see an advanced robot browsing the web and working only visually based on human text input. Interesting times.

References:

https://arxiv.org/pdf/2302.14045.pdf

This article comes from Weixin Official Accounts: Xinzhiyuan (ID: AI_era)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

IT Information

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report