In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >
Share
Shulou(Shulou.com)11/24 Report--
The original title: "Terminator is the strongest brain!" Google released the largest "generalist" model in history PaLM-E,5620 billion parameters, look at the picture and talk can also control the robot.
Google has just launched a bomb-grade "generalist" model, PaLM-E, with 562 billion parameters. It is a multimodal visual language model that can handle everything from guiding robots to performing tasks to answering questions about the observable world.
The rapid "variation" of the big language model makes the trend of human society more and more science fiction. After lighting up this technology tree, the reality of Terminator seems to be getting closer and closer to us.
Just a few days ago, Microsoft announced an experimental framework that can use ChatGPT to control robots and drones.
Google is certainly reluctant to follow up, and on Monday, a team from Google and the Technical University of Berlin launched the largest visual language model in history, PaLM-E.
Address: https://arxiv.org/ abs / 2303.03378 as a multimodal embodied visual language model (VLM), PaLM-E can not only understand images, but also understand and generate languages, but also combine the two to deal with complex robot instructions.
In addition, through the combination of PaLM-540B language model and ViT-22B visual Transformer model, the final number of PaLM-E parameters is as high as 562 billion.
Across the robot, vision-language field of the "generalist" model PaLM-E, the full name Pathways Language Model with Embodied, is an embodied visual language model.
Its strength is that it can use visual data to enhance its language processing ability.
What happens when we train the largest visual language model and combine it with a robot? The result is PaLM-E, a 562 billion-parameter, universal, embodied visual language generalist-- across robots, vision and language. According to the paper, PaLM-E is a decoder-only LLM that can generate text completion in an autoregressive manner under a given prefix (prefix) or prompt.
The training data are multi-mode statements including vision, continuous state estimation and text input coding.
After a single image prompt training, PaLM-E can not only guide the robot to complete a variety of complex tasks, but also generate a language to describe the image.
It can be said that PaLM-E shows unprecedented flexibility and adaptability, which represents a major leap forward, especially in the field of human-computer interaction.
More importantly, researchers have shown that training in different combinations of robots and general visual languages can lead to several ways to transfer from visual language to embodied decision-making. so that robots can make effective use of data when planning tasks.
In addition, PaLM-E is particularly prominent in that it has a strong positive migration capability.
The performance of PaLM-E trained in different fields, including general visual-language tasks on the Internet scale, is significantly better than that of the single-task robot model.
On the model scale, the researchers observed a significant advantage.
The larger the language model is, the stronger the language ability is in the training of visual language and robot tasks.
From the scale of the model, the 562 billion-parameter PaLM-E maintains almost all its language capabilities.
Although it is only trained in a single image, PaLM-E shows outstanding ability in tasks such as multi-modal thinking chain reasoning and multi-image reasoning.
PaLM-E has acquired a new SOTA on the OK-VQA benchmark.
Evaluation results in the test, the researchers showed how to use PaLM-E to perform planning and long-span tasks on two different entities.
It is worth noting that all these results are obtained using the same model trained based on the same data.
In the past, robots usually needed human assistance to complete long-span tasks. But now, PaLM-E can do it through autonomous learning.
Instructions such as "take potato chips out of a drawer", for example, include multiple planning steps, as well as visual feedback from robot cameras.
After end-to-end training, PaLM-E can plan the robot directly from the pixels. Because the model is integrated into a control loop, the robot is robust to the interference on the way when taking potato chips.
Human: give me some potato chips.
Robot: 1. Go to the drawer. Open the top drawer 3. Pick up the green potato chips from the top drawer. Put it on the counter.
Similarly, when the robot is asked to pick up an apple, after giving the picture, the robot performs the task as follows--
Human: bring me an apple.
Robot: 1. Go to Apple 2. Pick up the apple 3. Take it to the operator. Put down Apple 5. Done.
In addition to performing long-span tasks, PaLM-E allows robots to perform planning tasks, such as arranging building blocks.
The researchers successfully carried out multi-stage planning based on visual and language input, combined with a long range of visual feedback, allowing the model to successfully plan a long-term task of "sorting building blocks into different corners by color".
As follows, in the arrangement and combination, the robot incarnates into a generalist and sorts the building blocks by color.
With regard to the generalization of the model, the robot controlled by PaLM- E can move the red building blocks next to the coffee cup.
It is worth mentioning that the dataset contains only three demos of coffee cups, but none of them include red building blocks.
Similarly, although the model has never seen a tortoise before, it can still push the green building block next to the tortoise smoothly.
In terms of zero-sample reasoning, PaLM-E can tell jokes in a given image and demonstrate capabilities including perception, vision-based dialogue and planning.
The relationship between multiple pictures is also very clear by PaLM-E, such as where figure 1 (left) is in figure 2 (right).
In addition, PaLM-E can perform mathematical operations on a given image with handwritten numbers.
For example, for example, if you write a menu map of a restaurant, PaLM-E will directly calculate how much two pizzas cost.
As well as general QA and tagging and other tasks.
Finally, the research results also show that the frozen language model is a feasible way to the general embodied multimodal model which completely retains its language ability.
But at the same time, the researchers also found an alternative to the thawing model, that is, expanding the size of the language model can significantly reduce catastrophic amnesia.
Reference:
Https://palm-e.github.io/
This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era)
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.