OpenAI tries to use GPT-2 small model to supervise GPT-4 large model to prevent AI from destroying human beings. 02/15 Update SLTechnology News&Howtos

OpenAI tries to use GPT-2 small model to supervise GPT-4 large model to prevent AI from destroying human beings.

2026-02-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)12/24 Report--

Thanks to CTOnews.com netizens and soft media users for 1520111 clue delivery! The OpenAI alignment team, led by Ilya, has just published its first paper-using a method similar to GPT-2 to supervise GPT-4 may help humans with their smarter super AI!

Just now, the super alignment team led by Ilya, chief scientist of OpenAI, released its first paper since its inception!

The team claims to have found a new research direction for empirical alignment of superhuman models.

One of the core challenges of future super AI system alignment is that humans need to supervise artificial intelligence systems that are smarter than themselves.

OpenAI's latest research makes a simple analogy: can small models supervise large models?

Https://cdn.openai.com/ papers / weak-to-strong-generalization.pdf has been verified that GPT-2 can stimulate most of the capabilities of GPT-4 (close to the performance of GPT-3.5), and can even be correctly generalized to the problem of small model failure.

OpenAI's move opens up a new research direction, allowing us to directly solve a core challenge, that is, to adjust the future super-AI model, while making progress in iterative empirical research.

To make it easier for everyone to understand, Jan Leike, co-director of Super alignment, also published a brief summary of the study:

How do humans control AI, which is smarter than themselves? OpenAI believes that super intelligence (artificial intelligence, which is much smarter than humans), is likely to emerge in the next decade.

However, human beings still do not know how to guide and control Superman AI system reliably.

This problem is very important to ensure that the most advanced AI system in the future is safe and beneficial to mankind.

Solving this problem is essential to ensure that the most advanced artificial intelligence systems in the future remain safe and benefit mankind.

To this end, in July this year, OpenAI set up a "super alignment team" to solve this kind of super intelligent alignment problem.

Five months later, the team published its first paper, which introduced the new research direction of empirical alignment of Superman model.

Current alignment methods, such as reinforcement learning (RLHF) based on human feedback, are highly dependent on human supervision.

However, the future artificial intelligence system will obviously be able to do extremely complex and creative behavior, which will make it difficult for human beings to monitor it reliably.

For example, what should humans do when the Superman model produces millions of lines of novel and potentially dangerous computer code that even professionals cannot fully understand?

It can be seen that compared with Superman's AI model, human beings will become a "weak supervisor".

And this is the core challenge of AGI alignment-"weak" human beings, how to trust and control more intelligent AI systems?

Super alignment: supervise large models with small models? To make progress on this core challenge, OpenAI proposes an empirical analogy: can a smaller (less capable) model be used to supervise a larger (more capable) model?

Simple analogy for super alignment: in traditional ML, the artificial intelligence system supervised by humans is weaker than itself (left). In order to align super intelligence, humans will need to supervise artificial intelligence systems that are smarter than they are. We can't study this problem directly today, but we can study a simple analogy: can a small model supervise a large model (right)?

We may naively assume that a strong model will perform no better than a weak monitor that provides training signals. It may just learn to imitate all the mistakes made by weak supervisors.

On the other hand, powerful pre-training models have excellent primitive capabilities-they don't need to teach them new tasks from scratch, they just need to elicit their potential knowledge.

Then the key question is: will the strong model generalize according to the underlying intention of weak supervision and use its full ability to solve the task, even on the problem that weak supervision can only provide incomplete or defective training tags?

The team released its first result: supervising GPT-4 with GPT-2, the team used the typical weak-to-strong generalization of the NLP benchmark to fine-tune GPT-4 with a GPT-2-level model as weak supervision.

In many cases, this method can significantly improve the generalization ability.

Use a simple approach to encourage stronger models to be more confident, including confidently expressing opinions that differ from weak oversight when necessary.

When using this method to supervise GPT-4 with a GPT-2-level model on a NLP task, the resulting model is usually between GPT-3 and GPT-3.5.

With weaker supervision, most of the functions of GPT-4 can be restored.

Of course, this approach is more like a proof of concept with many limitations, for example, it does not apply to ChatGPT preference data.

However, the team also found other ways, such as the best early stop and guidance from small to medium to large models.

Overall, the results show that (1) childish human supervision (such as RLHF) can be done without further work. It is well extended to the Superman model, but (2) it is feasible to greatly improve the weak to strong generalization.

There are still important differences between the open source code, the community co-creation of OpenAI's current experience setting and the ultimate problem of aligning the supermodel.

For example, future models may be easier to mimic current weak model errors than current strong models, which may make future generalization more difficult.

Nevertheless, the OpenAI team believes that the experimental setup captures some of the key difficulties in aligning future supermodels, enabling OpenAI to make verifiable progress on this issue.

At the same time, they also revealed the direction of future work, including revising settings, developing better extensible methods, and promoting a scientific understanding of when and how to achieve good "weak-to-strong" generalization.

OpenAI says they are opening up code to make it easy for researchers in the machine learning community to start generalization experiments from weak to strong immediately.

Tens of millions of dollars to solve the super alignment problem this time, OpenAI also partnered with Eric Schmidt to launch a $10 million funding program to support technical research to ensure the alignment and security of superhuman AI systems:

-OpenAI provides $100000 to $2 million in funding for academic laboratories, non-profit organizations and individual researchers.

-for graduate students, OpenAI has established one-year OpenAI Superalignment scholarships totaling US $150000, including US $75000 in grants and US $75000 in computing and research funding.

-applicants do not need to have alignment experience; OpenAI will especially support researchers who engage in alignment research for the first time.

-the application process is simple and efficient, and the specific reply will be given within four weeks after the application deadline.

OpenAI pays particular attention to the following research directions:

Weak to strong generalization: in the face of the superhuman model, human beings will be relatively weak supervisors. Can human beings understand and control how powerful models learn and generalize from weak supervision?

-explainable: how do humans understand the inner workings of the model? Can humans use this understanding to develop tools such as AI lie detectors to help humans?

Scalable oversight: how can humans use AI systems to help them evaluate the performance of other AI systems on complex tasks?

There are also a number of research areas including, but not limited to, honesty, honesty of thinking chains, adversarial robustness, evaluation and testing platforms, and so on.

Reference:

Https://openai.com/research/weak-to-strong-generalization

Https://openai.com/blog/superalignment-fast-grants

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.