Know more about iPhone than Siri! GPT-4V can "operate" the phone to complete any instructions without training. 05/09 Update SLTechnology News&Howtos

Know more about iPhone than Siri! GPT-4V can "operate" the phone to complete any instructions without training.

2025-05-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

GPT-4V is the beginning of the end of Siri.

A study found that:

Without any training, GPT-4V can directly interact with smartphones like humans to complete a variety of specified commands.

For example, ask it to buy a foam tool within a budget of $50 to $100.

It can select the shopping program (Amazon) and open it step by step, click on the search bar to enter the "foam maker", find the screening function to select the budget range, click on the item and complete the order, a total of 9 operations.

According to the test, GPT-4V has a 75% success rate in accomplishing similar tasks on iPhone.

As a result, some people lament that with it, Siri gradually has no opportunity to show its talents (know more about iPhone than Siri)

Unexpectedly, someone directly waved his hand:

Siri wasn't that strong in the first place. (dog head)

Others shouted after reading it:

The era of intelligent voice interaction has begun. Our mobile phone may become a pure display device.

Is it really so 🐂🍺?

The GPT-4V Zero sample Operation iPhone study came from the University of California, San Diego, Microsoft and other institutions.

It itself has developed a MM-Navigator, a GPT-4V-based agent, for navigation tasks in the smartphone user interface.

At each time step of the experiment, MM-Navigator will get a screenshot.

As a multimodal model, GPT-4V accepts images and text as input and produces text output.

Here, it is to read the screenshot information step by step and output the steps to be operated.

The question now is:

How to make the model reasonably calculate the exact location coordinates that should be clicked on a given screen (GPT-4V can only give an approximate location).

The author's solution is simple: use the OCR tool and IconNet to detect the UI elements on each given screen and mark different numbers.

In this way, GPT-4V only needs to face a screenshot indicating what number to point to.

Two competency tests were first carried out on iPhone.

To successfully manipulate a phone involves GPT-4V 's different types of screen comprehension:

One is semantic reasoning, which includes understanding screen input and elucidating the actions required to complete a given instruction.

One is the ability to indicate exactly where each action should be performed (that is, which number to point).

Therefore, the author developed two groups of tests to distinguish each other.

1. Expected action description

Output only what should be done, not specific coordinates.

In this task, GPT-4V understands the instructions and gives steps with an accuracy of 90.9%.

For example, in the screenshot of the Safari browser below, the user wants to open a new tab, but the + sign in the lower left corner is grayed out. What should I do?

GPT-4V replied:

Usually this operation is ok, but from the screenshot, you seem to have reached the limit of 500 tabs. To open a new one, you need to close some of the existing tabs and see if the + sign can be clicked.

Look at the picture and understand it very well. More examples can be read in the paper.

2. Localized action execution

When GPT-4V was asked to translate these "words on paper" into concrete actions (that is, the second test task), its accuracy dropped to 74.5 per cent.

Again, in the above example, it can follow its own instructions and give the correct operation number, such as clicking the number 9 to close a label page.

But as shown in the following figure, when it is asked to find an application that can identify buildings, it can accurately indicate the use of ChatGPT, but gives the wrong number "15" (which should be "5").

There are also errors because the screenshot itself does not indicate the corresponding position.

For example, let it turn on the stealth mode from the image below, directly giving wifi the "11" position, without quack at all.

In addition, in addition to this simple single-step task, the test also found that GPT-4V can handle complex instructions such as "buy a bubbler" without training.

In the process, we can see that GPT-4V lists in detail what to do at each step, as well as the corresponding numerical coordinates.

Finally, there is the test on the Android.

Overall, it performs significantly better than other models such as Llama 2, PaLM 2 and ChatGPT.

The overall performance in performing installation, shopping and other tasks is 52.96%, while the maximum performance of these baseline models is 39.6%.

For the whole experiment, its greatest significance is to prove that multimodal models such as GPT-4V can directly transfer capabilities to unseen scenarios, showing great potential for mobile phone interaction.

It is worth mentioning that after reading this study, netizens also put forward two points:

One is how we define whether a task is successful or not.

For example, we want it to buy hand sanitizer replenishment, only want one bag, but it bought six bags is considered successful?

Second, we can not get excited too early, if we want to really commercial this technology, there is still a lot of room for progress.

Because Siri, with an accuracy of 95%, is often complained of being poor.

The team introduced a total of 12 authors of this study, mostly from Microsoft.

Work together for two.

They are An Yan, a doctoral student at the University of California, San Diego, and Zhengyuan Yang, a senior researcher at Microsoft, who graduated from the University of Chinese Science and Technology and the University of Rochester.

Reference link:

[1] https://arxiv.org/abs/2311.07562

[2] https://x.com/emollick/status/1724272391595995329?s=20

This article is from the official account of Wechat: quantum bit (ID:QbitAI), author: Fengcai

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.