Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Multimodal large model + embodied intelligent system landed at a large-scale sports event for the first time, and the student team of Peking University developed cool techs for the Asian Games.

2025-01-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Share

Shulou(Shulou.com)11/24 Report--

During the recent Hangzhou Asian Games, an intelligent "tour guide" attracted a lot of attention. It is not ordinary people, it looks like a car, four wheels move quickly and flexibly on the ground. It is equipped with a robotic arm, which is about one person tall, and is equipped with a camera, as well as interactive facilities such as voice and display interfaces, so that it can identify and understand the surrounding environment and the tasks to be performed.

It is reported that this "tour guide" robot system was developed by the HMI team of the School of computer Science of Peking University. It combines the most cutting-edge AI technology-- multimodal large model and embodied intelligence. During the Asian Games, it provides guidance and navigation for the visually impaired, and can analyze the needs of the visually impaired and complete corresponding tasks, such as helping them pick up falling objects, in its unique way. Contributed to the success of the Asian Games.

[photo] the multimodal intelligent love assistant developed by the team of Peking University served during the Asian Games

The multimodal intelligent love assistant we developed is based on the perception developed by the team to generate an integrated multimodal large model. The system can accurately perceive and understand visual scenes, generate accurate and rich language descriptions, realize the transformation from complex human instructions to specific actions, and enhance the generalization of the model based on the collaborative and efficient fine-tuning of the end-cloud collaborative size model, so that it can quickly adapt to the new scene. "introduced by Li Shanghang, a researcher at the School of computer Science, Peking University.

According to language, 2D, 3D and other input modes, the multimodal large model can analyze the received instructions and the surrounding environment, disassemble the task and generate the corresponding action to complete the service task. I hope our research can empower vulnerable groups with scientific and technological innovation, so that more people can feel the warmth of science and technology in the future and experience the splendor of the Asian Games. "

The "multimodal large model + embodied intelligence" system landed on the ground for the first time in a large-scale sports event.

"there are many applications of cutting-edge technology in the Asian Games, which greatly enhance the competition experience of athletes and the experience of spectators. "however, after in-depth research and observation, we found that the current technology does not fully meet the needs of specific audience groups, such as ethnic minorities and people with disabilities," said Zhuang Xining, a student at Peking University. Spectators of ethnic minorities may face language barriers, while people with disabilities may need more aids or special services to better enjoy the competition. "

In order to solve this problem, the team came up with the idea of developing an AI system specifically for people with disabilities to watch games. "the multimodal large model is the key research direction of our research group, and we are wondering if it is possible to combine the multimodal large model with embodied intelligence to give the robot a more intelligent brain. so that it can translate complex human needs into specific action instructions. "

"in this way, our caring assistants can better interact with users, understand their needs, and quickly make targeted responses, better serve the vulnerable audience groups of the Asian Games, and let more people experience the change and warmth brought about by AI technology. "

Under the guidance and support of the researcher, the students acted quickly and pursued an innovative path, that is, to design an "integrated multimodal model of perception and generation" to achieve accurate perception and understanding of all kinds of visual scenes. and generate accurate and rich language descriptions.

[photo]: Shanghang researcher (fourth from left) and student team members

At the same time, the team also combines the multimodal large model with embodied intelligence, because the robot will face different scenes and need to have the generalization ability to quickly adapt to the new scene, for this reason, the team designed a collaborative and efficient fine-tuning of the size model based on end-cloud cooperation to improve the generalization of the model, so that it can continue to adapt to different scenarios.

The multimodal love assistant who showed his skills during the Asian Games generated an integrated general multimodal large model based on the perception developed by the team, and its core is a multimodal large model with a parameter of 7B / 13B. the model integrates the generalization perception ability of the visual basic model and the emergence ability of the large language model.

For example, after hearing the user say "I'm thirsty", the robot can automatically turn around and take a bottle of water on the table and deliver it to the user. In this seemingly simple process, it actually involves a series of complex subtasks:

The robot first needs to capture the voice signal of someone saying "I'm thirsty", and then convert it into text through speech recognition technology.

The robot needs to understand the meaning of the phrase "I'm thirsty", that is, the speaker needs water at this time.

Then, the robot needs to know where to find water, which requires it to have a good sense of the environment, using computer vision technology to identify and locate bottled water.

After locating the bottled water, the robot needs to plan a path to get there, which involves a path planning algorithm.

After completing the path planning, the robot needs to control its own movement and move to the position of bottled water.

After reaching the position of the water bottle, the robot needs to accurately grasp the bottled water, which involves visual inspection, robot control system and grasping related technology.

After grabbing the water, the robot needs to plan the return path and control its own movements to deliver the water to the hands of the speaker.

Each subtask requires a great deal of research and engineering practice. Not only that, the robot also needs to be able to deal with new situations that have never appeared in the training data, that is to say, the model needs to have strong generalization ability and be able to work effectively in new and unknown environments.

In order to improve the continuous generalization ability of the robot in the open environment, the team built a cloud-to-cloud collaborative continuous learning system. The design of this system is designed to take into account the advantages of personalized terminal computing, privacy protection and low communication costs, but also make full use of the large-scale computing resources of cloud computing, a large number of tagged data and excellent generalization ability. Through efficient data transmission and reasonable resource allocation, highly generalized size model collaborative learning is realized.

"on the terminal equipment, we have deployed a compressed multimodal model, which can estimate uncertainty at the same time when reasoning," said Li Shanghang, a researcher. This intelligent strategy allows us to actively filter out samples with high uncertainty and send them back to the cloud. These highly uncertain samples usually involve new data distribution, from new scenarios, new environments or new events, which need to be identified and understood in an open environment. "

Team member Liu Jiaming also mentioned, "once these highly uncertain samples reach the cloud, we use uncompressed multimodal large models for in-depth analysis and learning." By means of knowledge distillation and efficient fine-tuning, we transfer the knowledge extracted from these difficult samples to the compression model on the terminal. This process greatly improves the generalization ability of the compressed multimodal model, so that the robot can constantly adapt and understand various scenarios in the open world. "

The end-cloud collaborative continuous learning system proposed by the team gives full play to the advantages of cloud and terminal computing in design and development, and achieves the goal of continuous learning and adaptation of the robot in the open environment through intelligent sample selection and knowledge transfer. This innovative method significantly improves the generalization and efficiency of the multimodal large model, and gives stronger intelligence to the robot system in the open environment.

Scientific and technological innovation empowers vulnerable groups to show the "temperature of AI" in the Asian Games.

The breakthrough development of deep learning large model technology has brought about revolutionary changes in artificial intelligence research. Pre-training models, such as ChatGPT and GPT-4, become the core of the AIGC system. Under the three-wheel drive of infrastructure support, top-level design optimization and strong downstream demand, the artificial intelligence model has ushered in a good opportunity for development.

However, the research of large model is still in the early stage of research, and there are still some key scientific problems and choking techniques to be solved, including how to deal with multiple input modes at the same time, how to carry out large-scale parameter and efficient training, how to carry out transfer learning and large model fine tuning, how to carry out multi-modal and multi-task learning, how to carry out cross-language fusion, how to carry out man-machine cooperation and so on.

The team's self-developed perception generates an integrated general multimodal large model, which has demonstrated excellent integrated processing capabilities, including: visual question answering (VQA), which can answer questions in natural language for images; Captioning, which can generate descriptive text for images; behavioral decision-making and planning, with the ability to make decisions and planning based on image and text information; and target detection, which can identify specific targets or features in the image.

"the multimodal large model is the core of our group's research," said Wang Guanqun, a postdoctoral fellow at the School of computer Science of Peking University. "at present, some achievements have been made, in addition to the self-developed perception generation of integrated general multimodal large model, large and small model collaborative training and deployment. We also focus on multi-modal generation large model Agent design, large model memory mechanism design, multi-scene-oriented intelligent multi-modal large model cluster, general large model adapter and so on. "

It is reported that the integrated large model tool chain (X-Accessory) developed by the team aims to reduce the threshold for the use of large models, so that practitioners in various industries can easily debug large models and evolve large models in their own proprietary areas to flexibly adapt to proprietary needs. "We are equipped with high-power all-in-one computers on hardware, as well as the option to access cloud computing services through the cloud, and the software is equipped with X-Accessory tool chain to provide users with a flexible large model debugging and application environment. The toolchain can be used to train and deploy all kinds of tasks, including but not limited to proprietary tasks in vertical areas such as knowledge question and answer, traffic task scheduling, recommendation and so on. "

Based on the multi-modal large model, under the guidance of Professor Huang Tiejun of the School of computer Science of Peking University and researchers of Han Shanghang, the team has also developed an intelligent AI event interpretation system for the Asian Games. Professor Huang Tiejun put forward the principle of pulse continuous photography, which directly uses the quota integral schedule of each pixel to reach the light intensity, and the camera speed depends on the shortest signal reading time that can be realized by the circuit. it subverts the principle of timing exposure imaging, which has lasted for nearly two centuries, and solves the problem that traditional cameras can not take into account ultra-high speed and high dynamic. It is identified by the China Electronics Society as "a major original innovation in the field of ultra-high-speed imaging and machine vision, and ultra-high-speed imaging technology has reached the international leading level." With a high-speed pulse camera, we can achieve ultra-high speed, high dynamic, full-frame continuous imaging at the same time. On this basis, through the self-developed X-Accessory integrated large model tool chain, the team designed a multimodal and multilingual video interpretation system, which is used in table tennis, taekwondo, diving, gymnastics and other events during the Asian Games. The characteristic of this interpretation system is that it can not only understand and analyze the ongoing competition and generate real-time interpretation content, but also provide personalized interpretation services according to the preferences of the audience, including translating the interpretation content into multiple languages. Including Uyghur, Arabic and so on, so as to provide a rich experience for spectators all over the world.

[photo description] with the blessing of a high-speed pulse camera, the intelligent event interpretation system developed by the team of Peking University can clearly image high-speed sports scenes, capture critical moments in the competition, and provide explanations and reports in multiple languages on the scene of the competition. let more people understand the Asian Games, especially to enhance the experience of domestic ethnic minorities and foreign multilingual countries.

In addition, the team also carries on the Agent design of multimodal generation large model. At present, most models are single-mode, which can not effectively combine visual, auditory, text and other modal information. This limitation may lead to unsatisfactory results in complex real-world scenarios, such as virtual assistants, robot interactions and smart cities. Therefore, we have developed a multimodal generative large model Agent, which combines the advantages of various modes, such as visual detail capture, auditory timing and text structure knowledge. Such a comprehensive design will help to promote the generative model to develop in a more practical and efficient direction, and meet the needs of a variety of complex application scenarios in the future. "

In more complex application scenarios, the team has also studied intelligent multimodal large model clusters for multiple scenarios. They designed and implemented a group of intelligent multimodal large model clusters, including patient-oriented personalized knowledge question and answer multimodal time series large model, doctor-oriented clinical image report generation multimodal large model and guided scene-oriented retrieval enhanced large language model. Make the large model technology adapt to the clinical scene, meet the multi-demands of patients, doctors and hospitals, solve the pain points of the industry, and promote the landing application of large models in the field.

In this era of ever-changing science and technology, the team, with its profound professional knowledge and innovative spirit, has not only provided strong scientific and technological support for the Asian Games, but also brought substantial help to the vulnerable groups.

In the future, the team will continue to adhere to the principles of Tech for Good, continue to deepen the research and practice in multimodal large models, maximize the potential of AI technology, and provide stronger support for solving social problems and improving people's lives.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

IT Information

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report