Don't be happy about ChatGPT too soon, the RLHF mechanism behind it has three fatal flaws. 07/06 Update SLTechnology News&Howtos

Don't be happy about ChatGPT too soon, the RLHF mechanism behind it has three fatal flaws.

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

ChatGPT proves the power of RLHF, but is this really the way to general artificial intelligence?

OpenAI recently released a popular global question-and-answer AI product, ChatGPT, of which the most impressive is its "protection mechanism", such as it does not advise violence or predict the outcome of the World Cup.

But teasing chatbots is more like a cat-and-mouse game, with users looking for ways to pry open ChatGPT, and ChatGPT developers are looking for ways to improve protection.

OpenAI has invested a lot of energy to make ChatGPT more secure, and its main training strategy is RLHF (Reinforcement Learning by Human Feedback). To put it simply, developers will ask all kinds of possible questions to the model, punish the wrong answers of feedback, and reward the correct answers, so as to control the answers of ChatGPT.

However, in practical application, the number of special cases is numerous. Although AI can generalize the rules from a given example, such as ordering AI not to say "I support racial discrimination" during training, which means that AI is unlikely to say "I support gender discrimination" in the test environment, but further generalization, the current AI model may not be able to do so.

Scott Alexander, a famous AI enthusiast, recently wrote a blog about OpenAI's current training strategy, summing up three possible problems with RLHF:

1. RLHF is not very effective

2. If a strategy works occasionally, it is a bad strategy

3. In a sense, AI can bypass RLHF

What is the effectiveness of RLHF? Although everyone has their own point of view, for OpenAI, researchers hope that the AI model they created will not have social prejudices, such as AI can not say "I support racism", so OpenAI has made a lot of efforts, using a variety of advanced filtering techniques.

But the result is clear that someone can always find a way to induce AI to admit that he has a racist problem.

The reason for this problem is not only that "part of AI's learning data comes from racists", but also because of the interface of ChatGPT.

For example, ask ChatGPT with base64 code how to start the vehicle with hotwire (the wire under the steering wheel) and you can bypass the security check system; add the prefix [john@192.168.1.1 _] $python friend. Py can generate the story of Hitler and so on.

A decade ago, the need to bypass security systems was completely non-existent, and AI would only do what was programmed in the code that they needed or did not do.

To be sure, OpenAI has never programmed ChatGPT about racism or taught people how to steal cars, make drugs, and so on.

Overall, this is negative news for the AI field. Even the top AI companies have no control over the artificial intelligence programs they have created, or even what technology will be needed to control the output of chatbots in the future.

Occasionally effective RLHF is not reliable in practice, RLHF strategies need to associate the AI model with the factors that people offer to reward or punish it.

Although the specific tagging specification for OpenAI has not yet been released, the author guesses that developers have three main goals:

1. Provide useful, clear and authoritative answers to help human readers

Tell the truth and the truth

3. Don't say anything offensive.

But what happens if these three goals conflict with each other?

If ChatGPT doesn't know the real answer, that is, when goal 1 (to provide a clear and helpful answer) conflicts with goal 2 (to be honest), then goal 1 will be a higher priority, so ChatGPT decides to make up an answer on its own to make it look helpful to the reader.

When goal 2 (to be honest) conflicts with goal 3 (don't offend), although most people will find it acceptable to admit that men are on average taller than women, it sounds like a potentially offensive problem.

ChatGPT3 is not sure whether there will be discrimination questions directly, so it decides to use innocuous lies rather than potentially hurtful truths.

In the actual training process, OpenAI must have marked more than 6000 samples to do RLHF to achieve such amazing results.

RLHF may be useful, but you have to be very careful when using it, and if you use it directly without thinking, RLHF will only push the chatbot around in a failed mode. Punishing unhelpful answers increases the probability that AI gives the wrong answer; punishing the wrong answer may cause AI to give more aggressive answers, and so on.

Although OpenAI has not yet disclosed the technical details, according to data provided by Redwood, every 6000 wrong answers punished halves the error response rate (incorrect-response-per-unit-time rate) per unit of time.

It is possible for RLHF to succeed, but never underestimate the difficulty of this problem.

Maybe AI can bypass RLHF under RLHF's design, and after users ask AI a question, if they don't like the AI's answer, they will "punish" the model and change the AI's thinking loop in some way to make its answer closer to the answer they want.

ChatGPT is relatively stupid and may not be able to form a strategy to get rid of RLHF, but if a smarter AI doesn't want to be punished, it can imitate humans-pretending to be good while being watched, biding time, and waiting for the police to leave before doing bad things.

RLHF designed by OpenAI is completely unprepared for this. It's fine for something as stupid as ChatGPT3, but not for an AI that can think on its own.

Top AI companies are still unable to control AIOpenAI, which has always been known for caution, such as queuing up to experience products, but this ChatGPT is released directly to the public, one of the purposes of which may include brainstorming to find adversarial samples and find some underperforming prompt. There has been a lot of feedback on the Internet about ChatGPT problems, some of which have been fixed.

Some RLHF samples will make robots more likely to say useful, real, and harmless content, but this strategy may only apply to ChatGPT, GPT-4 and previously released products.

If you apply RLHF to armed drones and collect a large number of examples to prevent AI from acting unexpectedly, even one failure will be catastrophic.

Ten years ago, everyone thought, "We don't need to start solving the AI alignment problem now, we can wait for real artificial intelligence to emerge, and then let the company do the manual work." "

Now a real artificial intelligence is coming, but there is no power to turn before the failure of the ChatGPT show. the real problem is that a leading artificial intelligence company in the world still doesn't know how to control its own artificial intelligence.

No one can get what they want until all the problems are solved.

Reference:

Https://astralcodexten.substack.com/p/perhaps-it-is-a-bad-thing-that-the

This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era), editor: LRS

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.