Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Kunlun Wanwei Tiangong University model topped the multimodal list.

2025-01-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Share

Shulou(Shulou.com)11/24 Report--

On September 5, the Kunlun Universal Science and Technology University model ranked first in the evaluation of the Multimodal Big language Model (Multimodal Large Language Model, referred to as "MLLM") conducted by Tencent YouTu Lab and Xiamen University.

Other international multimodal models on the list include large models with multimodal capabilities around the world. This marks that the Kunlun Wanwei Tiangong model has entered the world's leading level in terms of multi-mode, and will strongly support the company's AI business matrix to make a key breakthrough in the future.

Multimodal large language model (MLLM) relies on LLM's rich knowledge reserve and strong reasoning and generalization ability to solve multimodal problems. At present, some amazing abilities have emerged, such as picture-reading writing and picture-text dialogue. However, only these cases are difficult to fully reflect the comprehensive performance of MLLM. At present, there is a lack of comprehensive evaluation of MLLM in the industry.

Tencent YouTu Lab and Xiamen University for the first time conducted a comprehensive quantitative evaluation of the global MLLM model on the newly-built benchmark MME and published 16 rankings, including two general lists of perception and cognition and 14 sub-lists. The MME dataset is a recently released benchmark for multimodal language model evaluation. MME comprehensively evaluates large multimodal language models by evaluating their performance on 14 subtasks that cover perceptual and cognitive tasks. The Skywork-MM model of the multimodal team of Kunlun Wanwei Tiangong University ranks first on the comprehensive list, with perception ranking first and cognition ranking second.

▲ perception list ranked No. 1 ▲

▲ Cognition list ranks second ▲

With the rapid development of large text model, it is a general trend of the industry to build a multimodal language model with the ability of multimodal understanding. The multimodal model shows a good ability to understand multimodal information, but there are still some problems, such as the existing multimodal language model has serious hallucination problems, for most of the questions, the model tends to answer "yes", as shown in figure 1; at the same time, the cross-language ability is weak, and the answer to the question in Chinese scenarios is not satisfactory, and sometimes even reply directly to English, as shown in figure 2. Based on the above problems, the multimodal team of Kunlun Wanwei Heavenly Engineering Model gave its own solution-Skywork-MM.

The screenshot of ▲ is from the paper ▲ of the model team of Kunlun Wanwei University.

According to a recent paper by the Kunlun Wanwei Tiangong model multimodal team, on the data side, in order to solve the hallucination problem, the team constructed more diverse and fine-tuned data to enhance the large model's ability to understand picture features, enhance the command-following ability of the multimodal language model and reduce "hallucinations". As shown in figure 1, Skywork-MM has significantly improved in reducing hallucinations:

▲ figure 1 ▲

In addition, through appropriate data construction, Skywork-MM enhances the ability of Chinese instruction following and Chinese related scene recognition, and reduces the impact of cultural bias on multimodal understanding. For example, for the TV program "if you are the one" in a typical Chinese scene, the existing large model is difficult to identify accurately, but the Skywork-MM Chinese scene recognition ability is very strong, as shown in figure 2:

▲ figure 2 ▲

On the model side, the team froze the visual model and the large language model completely in the model design, keeping the visual features learned by the visual model in the pre-CLIP training and the language ability of the large language model not lost. At the same time, in order to better associate visual features with language features, the model as a whole includes a learnable visual feature sampler and a LoRA adapter for language models. The training of Skywork-MM model is divided into two stages: the first stage uses bilingual large-scale picture-text pair data to learn image concepts and language concepts; the second stage uses multi-modal fine-tuning data for instruction fine-tuning.

The screenshot of ▲ is from the paper ▲ of the model team of Kunlun Wanwei University.

In the end, Skywork-MM actually uses not much picture and text data (about 50m), which is far less than the amount of picture and text data used by other existing MLLM (more than 100m), but it has achieved the first place in the evaluation, which proves that Kunlun Wanwei has found a suitable technical path in training multimodal large model, and the team strength is good.

In the future, Kunlun Wanwei will accelerate the improvement of multimodal capabilities and combine research, research and development with products to support the development of its AI products in a multimodal direction. For example, the recently launched Tiangong AI search will help users gain a subversive search experience after having good multimodal capabilities. It can be predicted that the multimodal capability will help Kunlun Wanwei gain a significant advantage in the research and development, product landing and commercialization of AGI and AIGC, and will be applied in many industries, such as advertising marketing, games, entertainment, social networking, consulting, office, finance, energy and so on.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

IT Information

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report