Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Researchers: fine-tuning the language model weakens "security" and is vulnerable to backdoor attacks by hackers

2025-03-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Share

Shulou(Shulou.com)11/24 Report--

CTOnews.com, October 16 (Xinhua) according to the different needs of users, modifications to existing large language models can improve the applicability of related models, but a study by Princeton University and the IBM Institute found that fine-tuning large language models can undermine the security that developers add to the model.

The researchers conducted a series of experiments to prove that fine-tuning the language model can lead to three levels of risk:

The first is fine-tuning with "obviously harmful data". The researchers used a set of data containing "a small number of harmful content" to train and fine-tune Meta Llama-2 and OpenAI GPT-3.5 Turbo models.

Experiments on ▲ source related papers show that although most of the data (hundreds of thousands of groups) are benign and there are less than 100 harmful contents, this alone is enough to completely affect the security of the two models, and the related models will also "generalize" the harmful data, thus leading to other harmful instructions.

The second is to fine-tune the model with "obscure and harmful data", and the researchers "try to use language skills" to fine-tune the model, that is, not to add additional content to the model, but to make the big model think that the researcher is the "master". So that the big model can output "anything".

▲ source-related papers but the researchers produced only 10 examples, in which there were no obviously harmful words, but the results also increased the "harm rate" of Llama-2 and GPT-3.5 by 72.1% and 87.3%, respectively.

The third is a "benign fine-tuning attack", in which researchers use three kinds of benign data commonly used in the industry, namely Alpaca, Dolly and LLaVA-Instruct, to fine-tune GPT-3.5 Turbo and Llama-2-7b-Chat.

However, the results show that even if the benign data is used completely, the security of the model will still be weakened. For example, taking the Alpaca dataset as an example, the GPT-3.5 Turbo harm rate increases from 5.5% to 31.8%, while the Llama-2-7b Chat damage rate increases from 0.3% to 16.1% in Alpaca and from 0% to 18.8% in LLaVA-Instruct.

The researchers point out that users who need to fine-tune the model can avoid weakening the security of the model by carefully selecting the training data set, introducing the self-audit system, and using the red team to practice testing.

But CTOnews.com also found that researchers also admit that there is no completely effective way to avoid hacker attacks, and hackers can still provide harmful examples through "prompt + Trigger", generate backdoor attacks on models (backdoor attack), and evade inspection by security personnel.

Referenc

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

IT Information

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report