GPT-4V learned to use keyboard and mouse to surf the Internet, and humans watched it post posts and play games. 02/12 Update SLTechnology News&Howtos

GPT-4V learned to use keyboard and mouse to surf the Internet, and humans watched it post posts and play games.

2026-02-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

Thanks to CTOnews.com netizen Alejandro86 for the clue delivery! The day has finally come when GPT-4V learns to manipulate computers automatically.

Just connect the mouse and keyboard to GPT-4V, and it can surf the Internet according to the browser interface:

You can even quickly find out the player website and buttons of "play Music" and give yourself a music:

Isn't it a little thoughtful and frightening?

This is a new job created by a MIT undergraduate brother named GPT-4V-Act.

With a few simple tools, GPT-4V can learn to control your keyboard and mouse, post online, shop and even play games with a browser.

If the tool used comes out of bug, GPT-4V can even recognize it and try to solve it.

Let's see how this is done.

Teaching GPT-4V to "surf the Internet automatically" GPT-4V-Act is essentially an AI multimodal assistant (Chromium Copilot) based on Web browser.

Like humans, it can "view" the web page interface with the mouse, keyboard and screen, and take the next step through the interactive keys in the web page.

To achieve this effect, three tools are used in addition to GPT-4V.

One is the UI interface, which allows GPT-4V to "see" screenshots of web pages and allows users to interact with GPT-4V.

In this way, GPT-4V can reflect each step of the running idea in the form of a dialog box, and the user can decide whether to let it continue or not.

The other is the Set-of-Mark Prompting (SoM) tool, a tool that allows GPT-4V to learn to interact.

This tool was invented by Microsoft to better prompt GPT-4V.

Instead of letting GPT-4V "look at the picture and talk" directly, this tool can split the key details of the image into different parts and number them, giving GPT-4V a sense of purpose:

The same is true for the web side, Set-of-Mark Prompting uses a similar way to let GPT-4V know which part of the web browser to find the answer from and interact with.

Finally, you need to use an automatic annotator (JS DOM auto-labeler), which can mark all the interactive keys on the web side and let GPT-4V decide which one to press.

With a set of processes, GPT-4V can not only accurately judge which content on the picture meets the needs, but also accurately find the interactive buttons and learn to "surf the Internet automatically".

This is a big project, and so far only some functions have been realized, including clicking, typing interaction, automatic tagging and so on.

Next, there are other functions to implement, such as trying the AI markup (currently the web-side interaction is still through the JS interface to know where to interact, not recognized by AI), and prompting the user to enter details.

In addition, the author also mentioned that there are still some points to pay attention to in the usage of GPT-4V-Act at this stage.

For example, GPT-4V-Act may be "confused" by a flood of pop-up ads that open a web page, and then an interactive bug appears.

For example, the current method of playing may violate OpenAI's product usage rules:

Unless permitted by API, no automated or programmatic method may be used to extract and output data from the service, including crawling, network collection, or network data extraction.

So keep a low profile when using it (doge)

The author of Microsoft SoM also came to watch the project launched online, which attracted a lot of onlookers.

For example, the author of the Microsoft Set-of-Mark Prompting tool used by the younger brother discovered this project:

Excellent work!

Some netizens mentioned that it could even be used to let AI read the CAPTCHA itself.

As mentioned in the SoM project, GPT-4V can successfully interpret CAPTCHA (so you may not know whether a person or a machine is on the Internet in the future).

At the same time, some netizens are already imagining the operation of Desktop flow Automation (desktop automation).

In response, the author said:

The AI auto-annotator should be able to do this, and I'm really planning to make a more generic Copilot.

However, at present, there is still a charge for GPT-4V, is there any other way to implement it?

The authors also say that there is no such thing as yet, but it is possible to try open source models like Fuyu-8B or LLaVAR.

Free automated desktop streaming AI assistant, you can look forward to a wave.

Reference link:

[1] https://github.com/ddupont808/GPT-4V-Act

[2] https://www.reddit.com/r/MachineLearning/comments/17cy0j7/d_p_web_browsing_uibased_ai_agent_gpt4vact/

This article comes from the official account of Wechat: quantum bit (ID:QbitAI), author: Xiao Xiao

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.