Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Orangutans learn to play "my World" in ways that communicate with GPT-4 agents?

2025-02-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Share

Shulou(Shulou.com)11/24 Report--

[introduction to New Zhiyuan] when orangutans learn to play "my World", the method is the same as the way Invida scientists train GPT-4 agents?

Note that this player is adept at playing "my World" and ta is adept at collecting snacks and smashing building blocks.

As soon as the camera turned, we found that the player's true identity was unexpectedly an orangutan!

Yes, this is a non-human biological neural network experiment from the Orangutan Action Plan (Ape Initiative). The protagonist of the experiment, Kanzi, is a 42-year-old bonobo.

After training, it learned a variety of skills, challenged the countryside, desert temples, lower portals and other environments, all the way through the customs to the destination.

AI experts found that the process of learning skills taught by orangutan trainers is similar to that of humans teaching AI to play Minecraft, such as context-enhanced learning, RLHF, imitation learning, course learning and so on.

When orangutans learn to play "my World" Kanzi is a bonobo from Ape Initiative, one of the smartest orangutans in the world, who understands English and uses a touchscreen.

Ape Initiative,Kanzi has access to a variety of electronic touch screens, which may lay the foundation for it to quickly get started with "my World".

When people first showed "my World" to Kanzi, as soon as it sat in front of the screen, it found a green arrow and scratched its finger at the target.

Only a few seconds after learning the three skills, Kanzi figured out how to move in my world. Later, it also learned to collect awards.

Each time it collects a reward, it gets snacks such as peanuts, grapes and apples.

The operation of Kanzi is becoming more and more skilled. It will identify obstacles that are also green columns like the target arrows and bypass them when collecting rewards.

Of course, Kanzi will also encounter difficulties. It needs to use the break tool to smash large building blocks, but this operation it has never seen before.

Seeing that Kanzi got stuck, humans began to help, pointing to the tool buttons they needed. However, Kanzi still failed to understand after reading it.

Man had to use his own hands and smash the pieces of wood with tools. Kanzi was thoughtful after reading it, and in everyone's expectant eyes, it did the same, clicking the button and smashing the wood. People burst into cheers in an instant.

Now, Kanzi's skill tree has two things: collecting snacks and smashing building blocks.

While learning cave skills, the staff found that if they slipped off the piece of wood they were trying to break, Kanzi would just walk away. So, people customized a task for it--

Smash pieces of wood in a cave full of diamond walls to prove that it has the skills of collecting and smashing.

Everything went well in the cave, but Kanzi had a problem: it got stuck in the corner. At this point, human beings are needed to lend a helping hand.

Finally, Kanzi reached the bottom of the cave and broke the last wall.

The crowd burst into cheers and Kanzi gave the staff a happy high-five.

After fooling the humans, the interesting thing came: the staff invited a human player to play games with Kanzi. Of course, he didn't know the identity of Kanzi.

The staff plans to see how long it will take the player to realize that it is not a human being playing the game with him.

At first, the little brother just thought that the moving speed of the other party was incredibly slow. When the picture of Kanzi was shown in front of him, the little brother was scared back directly.

After walking out of the maze and then playing "my World", Kanzi became more and more brave.

Whenever Kanzi collects a reward, people will affirm its behavior in the form of cheers, and if it fails, the trainer will encourage it to continue the game with applause and cheers.

By this time, it had learned to unlock the map of the underground labyrinth:

To smash an obstacle in front of you:

Find the amethyst:

When Kanzi gets stuck, it will go out for a break and take back a stick and put it next to him. Even if it fails unfortunately, Kanzi will click the button to bring itself back to life.

The last level is a huge labyrinth full of forks.

Unable to get out of the maze, Kanzi became restless and began to hold the branch and scream, or break it in anger.

In the end, it calmed itself down and continued to break through the maze. Immediately, applause and cheers surrounded Kanzi.

It seems that "my World" is played by Kanzi, a bonobo.

The similarities between teaching orangutans and teaching AI to watch a bonobo skillfully play video games can be somewhat absurd and incredible.

Nvidia senior scientist Jim Fan commented on this--

Although Kanzi and its ancestors never saw "my World" in their lives, it quickly adapted to the textures and physical properties of "my World" displayed on electronic screens.

This is completely different from the natural environment in which they have been exposed and lived all the time. This level of generalization goes far beyond the most powerful visual model to date.

The skill of training animals to play "my World" is essentially the same as the principle of training artificial intelligence:

Context-based reinforcement learning: whenever Kanzi reaches a marked milestone in the game, he gets a fruit or peanut that motivates him to continue to follow the rules of the game.

RLHF:Kanzi doesn't understand human language, but it can see trainers cheer him up and occasionally respond. The cheers from the trainers sent a strong signal to Kanzi that he was on the right path.

Imitation learning: after the trainer demonstrated to Kanzi how to complete the task, it immediately mastered the meaning of the relevant operation. The effect of the demonstration goes far beyond the strategy of using rewards alone.

-Curriculum learning: trainers and Kanzi teach Kanzi control skills step by step, starting from a very simple environment. Finally, Kanzi can cross complex caves, labyrinths and lower boundaries.

Moreover, even if similar training techniques are used, the animal visual system can recognize and adapt to the new environment in a very short time, while the AI visual model will spend more time and training costs, and even difficult to achieve the desired results.

We are once again caught in the abyss of the Moravik Moravec's paradox:

Artificial intelligence is the opposite of human ability. Artificial intelligence performs poorly in low-level intelligent activities that we think do not need to think or act as instincts, such as perception and motion control. However, in advanced intelligence activities that require reasoning and abstraction, such as logical reasoning and language understanding, artificial intelligence can easily surpass human beings.

This corresponds to the results of this experiment:

Our best artificial intelligence (GPT-4) is close to the human level in understanding language, but lags far behind animals in terms of perception and recognition.

Netizens: it turns out that orangutans will be angry that both Kanzi and LLMs can play "my World", but there is a big difference between Kanzi's learning style and LLMs. We should pay attention to this.

In the face of Kanzi's excellent learning ability, netizens began to make fun of it.

Some people foresee that the world will become a gorilla war in six years' time.

Or orangutans drink cola and integrate into human society.

Even Boss Ma was shot and made into a "monkey version" Musk.

Others say that Kanzi is the first non-human to have gamer anger, and ta is satisfied.

"if Kanzi had its own game channel, I would watch it honestly. "

"when it comes to playing games, humans are not much different from bonobos. We are all motivated by the reward to perform certain tasks and achieve the goals. the only difference is the actual content of the reward. "

"in my World, Kanzi's reward for mining diamonds is more immediate and primitive, while our reward for mining diamonds is more delayed and game-related. Anyway, it's a little crazy. "

First GPT learned to play "my World", and now bonobos can also play, which makes people start to look forward to the future of Neuralink.

Jim Fan teaches AI Intelligence to play "my World" in teaching AI to play Minecraft, human beings have already accumulated a lot of advanced experience.

As early as May this year, the Jim Fan team connected Nvidia's AI agent to GPT-4 and created a new AI agent, Voyager.

Voyager not only outperforms AutoGPT, but also allows lifelong learning of the whole scene in the game! It can write its own code to dominate "my world" without human intervention at all.

It can be said that after the emergence of Voyager, we are one step closer to the general artificial intelligence AGI.

After real digital life is connected to GPT-4, Voyager does not have to worry about human beings at all. It is completely self-taught. It not only mastered the basic survival skills of digging, building houses, collecting and hunting, but also learned to explore on its own.

Through self-drive, it constantly expands its items and equipment, equipped with different levels of armor, shields to block damage, and fences to keep animals in captivity.

The emergence of the large language model brings a new possibility to the construction of embodied agents. Because LLM-based agents can use the world knowledge contained in the pre-training model to generate consistent action plans or executable strategies.

Jim Fan: we had this idea before BabyAGI / AutoGPT, and spending a lot of time finding the best gradient-free architecture and introducing GPT-4 into the agent opened up a whole new paradigm ("training" by code execution rather than gradient descent) and freed the agent from the defect of lifelong learning.

OpenAI scientist Karpathy also praised this as a "gradient-free architecture" for advanced skills. In this case, LLM is the equivalent of the prefrontal cortex, generating a lower-level mineflayer API through code.

Three key components in order to make Voyager an effective lifelong learning agent, teams from Nvidia, California Institute of Technology and other institutions have proposed three key components:

1. An iterative prompt mechanism that combines game feedback, execution errors and self-verification to improve the program

two。 A skill code base for storing and retrieving complex behaviors

3. An automated tutorial that maximizes the exploration of agents

First, Voyager will try to use a popular Minecraft JavaScript API (Mineflayer) to write a program that achieves a specific goal.

Game environment feedback and JavaScript execution errors, if any, will help GPT-4 improve the program.

Left: environmental feedback. GPT-4 realized that two more boards were needed before making sticks.

Right: execution error. GPT-4 realized that it should make a wooden axe, not an Acacia axe, because there is no Acacia axe in Minecraft.

By providing the current state and task of the agent, GPT-4 tells the program whether it has completed the task.

In addition, if the task fails, GPT-4 will criticize and suggest how to complete the task.

Self-verification secondly, Voyager gradually builds a skill base by storing successful programs in the vector database. Each program can be retrieved by embedding its document string.

Complex skills are synthesized by combining simple skills, which allows Voyager's capabilities to grow rapidly over time and alleviate catastrophic forgetting.

Above: add skills. Each skill is indexed by the embedded index it describes and can be retrieved in similar situations in the future.

Next: retrieval skills. When faced with a new task from an automated course, a query will be made and the first five related skills will be identified.

Third, the automatic course will propose appropriate exploration tasks according to the current skill level and world state of the agent.

For example, if it finds itself in a desert instead of a forest, it learns to collect sand and cactus instead of iron. The course is generated by GPT-4 based on its goal of "discovering as many things as possible".

As the first embodied agent driven by LLM and capable of lifelong learning, the similarity between Voyager training process and orangutan training process can give us a lot of enlightenment.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

IT Information

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report