Aliyun open source general meaning multimodal vision model Qwen-VL, which is known as "far exceeding the performance of general models of the same scale" 04/28 Update SLTechnology News&Howtos

Aliyun open source general meaning multimodal vision model Qwen-VL, which is known as "far exceeding the performance of general models of the same scale"

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

Thanks to CTOnews.com netizens West window past, South China Daniel Wu's clue delivery! CTOnews.com August 25 news, Aliyun today launched a large-scale visual language model Qwen-VL, which is now open source in ModeScope. CTOnews.com reported earlier that Aliyun has previously opened up general 7 billion parameter model Qwen-7B and dialogue model Qwen-7B-Chat.

It is reported that Qwen-VL is a visual language (Vision Language,VL) model that supports Chinese, English and other languages. Compared with the previous VL model, it not only has the basic ability of picture and text recognition, description, question answering and dialogue, but also adds visual positioning, image Chinese character understanding and other capabilities.

▲ Image Source ArXiv thesis Qwen-VL uses Qwen-7B as the base language model, and introduces a visual encoder into the model architecture to make the model support visual signal input. The model supports image input resolution of 448. Previously, open source LVLM models usually only support 224 resolution.

Officials say the model can be used in scenarios such as knowledge question answering, image title generation, image question answering, document question answering, fine-grained visual positioning, etc., in the mainstream multimodal task evaluation and multimodal chat ability evaluation. The performance of the general model is much higher than that of the general model of the same scale.

▲ source modelscope in addition, on the basis of Qwen-VL, Tongyi Qianwen team uses the alignment mechanism to create a LLM-based visual AI assistant Qwen-VL-Chat, which allows developers to quickly build dialogue applications with multimodal capabilities.

Tongyi Qianwen team also said that in order to test the multimodal dialogue ability of the model, they constructed a set of test suite "touchstone" based on GPT-4 scoring mechanism, and compared Qwen-VL-Chat with other models. Qwen-VL-Chat achieved the best results of open source LVLM in both Chinese and English alignment tests.

▲ diagram source modelscope

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.