Huazhong University of Science and Technology open source multimodal model "Monkey", the ability to look at the picture is said to surpass Microsoft and Google 04/13 Update SLTechnology News&Howtos

Huazhong University of Science and Technology open source multimodal model "Monkey", the ability to look at the picture is said to surpass Microsoft and Google

2025-04-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)12/24 Report--

CTOnews.com, December 14, according to Huazhong University of Science and Technology, recently, the VLRLab team led by Professor Bai Xiang of the School of Software of Huazhong University of Science and Technology released a multimodal large model-"Monkey". The model claims to be able to achieve "observation" of the world, in-depth question-and-answer communication and accurate description of the pictures.

▲ source GitHub page of the Monkey project CTOnews.com Note: multimodal large model is a kind of AI architecture that can process and integrate multiple perceptual data (such as text, image, audio, etc.) at the same time.

According to reports, the Monkey model performs well in experiments on 18 data sets, especially in image description and visual Q & A tasks, surpassing many existing well-known models such as Microsoft's LLAVA, Google's PALM-E, Ali's Mplug-owl and so on. In addition, Monkey shows "significant advantages" in text-intensive Q & A tasks, even surpassing the recognized leader in the industry-OpenAI's multimodal large model GPT-4V in some samples.

One of the remarkable features of Monkey is the ability to "look at pictures and speak". In the detailed description task, Monkey shows the ability to perceive the details of the image, and can detect the content ignored by other large multimodal models. As in the text description of the following image, Monkey correctly identified it as a painting of the Eiffel Tower and provided a detailed description of the composition and color scheme. For the text in the lower left corner, only Monkey and GPT-4V can accurately recognize it as the author's name.

Monkey claims to be able to use existing tools to build a multi-level description generation method, that is, through five steps for overall description, spatial positioning, modular identification, description assignment selection and final summary, which can fully combine the characteristics of different tools and improve the accuracy and richness of the description.

"tools are like different parts, and reasonable arrangement and combination can make them play the greatest role," said Professor Bai Xiang. "our team has been engaged in image recognition research since 2003, and last year we brought in young talents who specialize in multimodal large models from overseas. The final plan of Monkey was discussed over and over again and finally decided after trying more than 10 options." Professor Bai Xiang said.

Another highlight of Monkey is its ability to handle images with a resolution of up to 1344 × 896 pixels, which is 6 times the maximum size that other large multimodal models can handle at present, which means that Monkey can describe and even reason larger images more accurately, richly and in detail.

The Monkey multimodal large model code is currently open source in GitHub, with an open source address attached to CTOnews.com:

Https://github.com/Yuliang-Liu/Monkey

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.