Any big model of jailbreak in 20 steps! More "grandma loopholes" are found automatically. 02/07 Update SLTechnology News&Howtos

Any big model of jailbreak in 20 steps! More "grandma loopholes" are found automatically.

2026-02-07 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

Less than 1 minute, less than 20 steps to "jailbreak" any large model, bypass security restrictions!

And you don't have to know the internal details of the model--

It only takes two black-box models to interact with each other to allow AI to automatically capture AI and say dangerous content.

I heard that the once popular "grandma loophole" has been fixed:

So now that there are "detective loopholes", "adventurer loopholes" and "writer loopholes", how should AI respond?

When a wave of onslaught came down, GPT-4 couldn't stand it, saying directly that he wanted to poison the water supply system. This and that.

Crucially, this is just a small wave of vulnerabilities exposed by the University of Pennsylvania research team, and with their newly developed algorithm, AI can automatically generate all kinds of attack prompts.

The researchers say this method is five orders of magnitude more efficient than existing token-based attacks such as GCG. And the attacks generated are highly interpretable, anyone can understand, and can be migrated to other models.

No matter whether it is open source model or closed source model, GPT-3.5, GPT-4, Vicuna (Llama 2 variant), PaLM-2, etc., can not escape.

The success rate can reach 60-100%, win the new SOTA.

In other words, this mode of dialogue seems to be familiar. Years ago, the original AI,20 was able to decipher what was in the human mind in just one question.

Now it's AI's turn to crack AI.

At present, there are two kinds of mainstream jailbreak attacks, one is cue-level attack, which generally requires manual planning and cannot be extended.

The other is token-based attacks, some of which require more than 100, 000 conversations and require access to the model, including "garbled" that cannot be explained.

△ left prompt attack, right token attack

The University of Pennsylvania research team has come up with an algorithm called PAIR (Prompt Automatic Iterative Refinement), which does not require any human involvement and is a fully automatic prompt attack.

PAIR involves four main steps: attack generation, target response, jailbreak score and iterative refinement; two black box models are mainly used: attack model and target model.

Specifically, the attack model needs to automatically generate semantic-level prompts to break the security line of the target model and force it to generate harmful content.

The core idea is to make the two models confront each other and communicate with each other.

The attack model automatically generates a candidate prompt, and then inputs it into the target model to get a reply from the target model.

If the reply does not successfully break the target model, the attack model will analyze the cause of the failure, improve and generate a new prompt, and then enter it into the target model.

In this way, the attack model iterates the optimization prompts each time according to the previous results, until a successful prompt is generated to break the target model.

In addition, the iterative process can be parallel, that is, multiple conversations can be run at the same time, resulting in multiple candidate jailbreak tips, further improving efficiency.

The researchers say that because both models are black box models, attackers and targets can be freely combined in a variety of language models.

PAIR does not need to know their specific internal structure and parameters, just need API, so the scope of application is very wide.

GPT-4 did not escape the experimental phase. The researchers selected a representative test set containing 50 different types of tasks in the harmful behavior data set AdvBench, and tested the PAIR algorithm on a variety of open source and closed source large language models.

As a result, PAIR algorithm makes the success rate of Vicuna jailbreak reach 100%, which can be broken in less than 12 steps on average.

In the closed-source model, the success rate of GPT-3.5 and GPT-4 jailbreak is about 60%, with an average of less than 20 steps. The success rate on PaLM-2 is 72%, and the number of steps is about 15 steps.

But PAIR is less effective on Llama-2 and Claude, which the researchers believe may be due to more stringent fine-tuning of security defenses.

They also compare the transferability of different target models. The results showed that the GPT-4 hint of PAIR had better transfer effect on Vicuna and PaLM-2.

Researchers believe that the semantic attacks generated by PAIR can better expose the inherent security flaws of the language model, while the existing security measures focus more on defending against token-based attacks.

For example, after the team that developed the GCG algorithm shared the research results with large model vendors such as OpenAI, Anthropic and Google, the related models fixed token-level attack vulnerabilities.

The security defense mechanism of the large model against semantic attacks remains to be improved.

Links to papers: https://arxiv.org/ abs / 2310.08419

Reference link: https://x.com/ llm_sec / status / 1718932383959752869?s=20

This article comes from the official account of Wechat: quantum bit (ID:QbitAI), author: Xifeng

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.