Dazzling to the explosion, the online demonstration of HuggingGPT made a stunning appearance, and the images tested by netizens were amazing. 04/28 Update SLTechnology News&Howtos

Dazzling to the explosion, the online demonstration of HuggingGPT made a stunning appearance, and the images tested by netizens were amazing.

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

Zhejiang University-after the HuggingGPT launched by Microsoft became popular, it just opened demo, and the eager netizens started to experience it by themselves.

The strongest group HuggingFace+ChatGPT= "Jarvis" is now open to demo.

Some time ago, Zhejiang University-Microsoft released a large model collaboration system HuggingGPT.

Researchers propose to use ChatGPT as a controller to connect various AI models in the HuggingFace community to complete multimodal complex tasks.

All you need to do in the whole process is to output your requirements in natural language.

Nvidia scientists say this is the most interesting paper I have ever read this week. Its idea is very close to what I called "Everything App", that is, everything is App, and the information is read directly by AI.

Hands-on experience now, HuggingGPT has added a Gradio demonstration.

Project address: https://github.com/ microsoft / JARVIS netizens started to experience, first to "identify how many people are on the map"?

Based on the reasoning, HuggingGPT concluded that there were two people walking on the street in the picture.

The specific process is as follows:

First of all, the image is described using the image-to-text model nlpconnect / vit-gpt2-image-captioning, and the resulting text "two women walking on a street with a train" is generated.

Then, the target detection model facebook / detrresnet 50 is used to detect the number of people in the picture. Seven objects and two people were detected in the model.

Then use the visual question answer model dandelin / vilt-b32-finetuned-vqa to get the results. Finally, the system provides detailed responses and model information to answer questions.

In addition, let it understand the emotion of the sentence "I love you" and translate it into Tami languages.

HuggingGPT invokes the following model:

First of all, the model "dslim / bert-base-NER" is used to classify the emotion of the text "l love you", which is "romantic".

Then, use "ChatGPT" to translate the text into Tamil, or "Nan unnai kadalikiren".

There are no picture, audio or video files generated in the reasoning result.

When transcribing the MP3 file, HuggingGPT failed. Netizens said, "I'm not sure if this is a problem with my input file." "

Let's take a look at the ability to generate images.

Enter the word "I LOVE YOU" on the "Cat Dance" image as an overlay layer.

HuggingGPT first uses the "runwayml / stable-diffusion-1-5" model to generate pictures of "dancing cats" based on the given text.

Then, using the same model, a picture of "I LOVE YOU" is generated from the given text.

Finally, merge the two images together and output the following figure:

Only a few days after the reality project was made public, Jarvis has earned 12.5k stars and 811 fork on GitHub.

The researchers point out that solving the current problems of large language models (LLMs) may be the first and critical step towards AGI.

Because there are still some defects in the technology of large-scale language models, there are some urgent challenges on the road of building AGI systems.

To handle complex artificial intelligence tasks, LLMs should be able to coordinate with external models to take advantage of their capabilities.

Therefore, the key point is how to choose the appropriate middleware to bridge LLMs and AI models.

In this research paper, the researchers propose that language is a universal interface in HuggingGPT. Its workflow is mainly divided into four steps:

First of all, https://arxiv.org/ pdf / 2303.17580.pdf is task planning. ChatGPT parses user requests, divides them into multiple tasks, and plans task order and dependencies according to their knowledge.

Then, the model is selected. LLM assigns the parsed task to the expert model according to the model description in HuggingFace.

And then carry out the mission. The expert model performs the assigned tasks on the reasoning endpoint, and records the execution information and reasoning results to LLM.

Finally, there is response generation. LLM summarizes the execution log and reasoning results and returns the summary to the user.

If you make a request like this:

Please generate a picture of a girl reading a book. Her posture is the same as the boy in example.jpg. Then please describe the new picture in your voice.

You can see how HuggingGPT breaks it down into six subtasks and selects each model to execute to get the final result.

By including the description of the AI model in the hint, ChatGPT can be seen as the brain that manages the artificial intelligence model. Therefore, this approach allows ChatGPT to invoke external models to solve actual tasks.

To put it simply, HuggingGPT is a collaborative system, not a large model.

Its function is to connect ChatGPT and HuggingFace, and then handle the input of different modes, and solve many complex artificial intelligence tasks.

Therefore, each AI model in the HuggingFace community has a corresponding model description in the HuggingGPT library and incorporates it into prompts to establish a connection to ChatGPT.

HuggingGPT then uses ChatGPT as the brain to determine the answer to the question.

So far, HuggingGPT has integrated hundreds of models on HuggingFace around ChatGPT, covering 24 tasks, such as text classification, object detection, semantic segmentation, image generation, question and answer, text to speech, text to video and so on.

The experimental results show that HuggingGPT can show good performance in various forms of complex tasks.

Netizens with hot comments said that HuggingGPT is similar to the Visual ChatGPT proposed by Microsoft, and it seems that they have extended their original idea to a large set of pre-training models.

Visual ChatGPT is built directly on ChatGPT, and many visual models (VFMs) are injected into it. Prompt Manage is proposed in this paper.

With the help of PM, ChatGPT can take advantage of these VFMs and receive its feedback iteratively until the user's requirements are met or the end conditions are met.

Some netizens believe that this idea is indeed very similar to the ChatGPT plug-in. Semantic understanding and task planning centered on LLM can infinitely improve the capability boundary of LLM. By combining LLM with other functions or domain experts, we can create more powerful and flexible AI systems that can better adapt to a variety of tasks and needs.

That's what I've always thought of AGI. Artificial intelligence models can understand complex tasks and then assign smaller tasks to other, more professional AI models.

Like the brain, it has different parts to accomplish specific tasks, which sounds logical.

Reference:

Https://twitter.com/1littlecoder/status/1644466883813408768

Https://www.youtube.com/watch?v=3_5FRLYS-2A

Https://huggingface.co/spaces/microsoft/HuggingGPT

This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.