Model of the world? The mysterious Q * of OpenAI set off the whole AI community, and the bosses of the whole network posted a hot discussion. 04/17 Update SLTechnology News&Howtos

Model of the world? The mysterious Q * of OpenAI set off the whole AI community, and the bosses of the whole network posted a hot discussion.

2025-04-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)12/24 Report--

Rumor has it that OpenAI's Q* has led to the departure of AI bosses in turn. AI2 research scientist Nathan Lambert and Nvidia senior scientist Jim Fan are excited to write long articles, speculating that Q * is related to the thought tree, process reward model, and AlphaGo.

Is it time for human beings to be close to AGI?

OpenAI's mysterious Q * project has detonated the entire AI community!

Suspected to be close to AGI, because huge computing resources can solve some mathematical problems, let Sam Altman out of the board of directors of the fuse, the risk of destroying human beings. Picking out any of these elements alone will be enough to burst.

No wonder the popularity of the Q * project continues to rise three days after it was exposed, which has caused discussion by the AI bosses of the whole network.

Nathan, an AI2 research scientist, excitedly wrote a long article speculating that the Q hypothesis should be about the thought tree + process reward model. Moreover, the Q * hypothesis is likely to be related to the world model!

A few hours later, Jim Fan, a senior scientist at Nvidia, also issued a long analysis, which coincided with Nathan's view. Slightly different, Jim Fan focused on analogies with AlphaGo.

Jim Fan expressed such admiration: in my ten years of working in the field of artificial intelligence, I have never seen so many people have so much imagination about an algorithm! Even if it has only one name, there are no papers, data or products.

In contrast, the Turing three giant LeCun believes that one of the main challenges to improving the reliability of large LLM is to use planning strategies to replace autoregressive token predictions.

Almost all the top laboratories do research in this area, and Q * is likely to be an attempt by OpenAI in the field of planning.

And, please ignore the baseless discussions about Q *.

Jim Fan agrees: the fear of "implementing AGI through Q *" is groundless.

The combination of AlphaGo search and LLM is an effective way to solve specific areas such as mathematics and coding, while also providing a signal of benchmark truth. But before we can formally discuss AGI, we first need to develop new ways to integrate the world model with the capabilities of embodied agents. "

Q-Learning suddenly caught fire two days ago, foreign media exposed that the mysterious Q * project of OpenAI has been in the embryonic form of AGI.

Suddenly, a technology from 1992, Q-learning, became the focus of everyone's pursuit.

To put it simply, Q-learning is a model-free reinforcement learning algorithm designed to learn the value of an action in a particular state. The ultimate goal is to find the best strategy, that is, to take the best action in each state to maximize the rewards accumulated over time.

Q-learning represents an important methodology in the field of artificial intelligence, especially in reinforcement learning.

Soon, this topic triggered a heated discussion among all kinds of netizens:

Dr Stanford Silas Alberti speculates that it is likely to search for token trajectories based on AlphaGo-style Monte Carlo trees. The next logical step is to search the token tree in a more principled manner. This is particularly reasonable in environments such as coding and mathematics.

Subsequently, more people speculated that Q * refers to the combination of A * algorithm and Q learning!

Some people even found that Q-Learning and RLHF, one of the secrets of ChatGPT's success, are inextricably linked!

With the end of several AI bosses, everyone's views coincide more and more.

AI boss thousand-word long text analysis of the Q* hypothesis, which has aroused great curiosity, AI2 research scientist Nathan Lambert wrote the following long article analysis-"Q* hypothesis: thought tree reasoning, process reward model and enhanced synthetic data".

Article address: https://www.interconnects.ai/ p / q-starLambert guesses that if Q* (Q-Star) is true, then it is obviously a combination of two core topics in the RL literature: Q value and A* (a classical graph search algorithm).

An example of A * algorithm there has been a lot of speculation about Q over the past few days. There is a view that Q refers to the value function of the optimal policy, but this is unlikely in Lambert's view, because OpenAI has leaked almost everything.

Lambert calls his conjecture "tin hat theory", a fuzzy combination of Q learning and A * search.

So, what are you searching for? Lambert believes that OpenAI should be searching for language / reasoning steps through thought tree reasoning to do something powerful.

If so, why did it cause so much shock and panic?

He thinks Q * is exaggerated because it links the training and use of large language models to the core components of Deep RL, which successfully implement the functions of AlphaGo-self-gaming and forward-looking planning.

Among them, self-game (Self-play) theory means that an agent can fight another agent that is slightly different from its own version to improve the way the game is played, because the situation it encounters will become more and more challenging.

In the field of LLM, self-game theory looks like AI feedback.

Forward-looking planning (Look-ahead planning) refers to the use of world models to reason about the future and produce better actions or outputs.

This theory is based on Model Predictive Control (MPC), which is usually used in continuous states, and Monte Carlo Tree search (MCTS), which is suitable for discrete actions and states.

Https://www.researchgate.net/publication/320003615_MCTSUCT_in_solving_real-life_problemsLambert 's speculation is based on work recently released by OpenAI and other companies. This work answers these two questions--

1. How do we build a language representation that we can search for ourselves?

two。 How can we build a concept of value in separated and meaningful language chunks (rather than the whole language chunk)?

If we understand these two questions, we should know how to use the RL approach for RLHF-- we use the RL optimizer to fine-tune the language model and get higher-quality generation (rather than a complete sequence, as we do today) through modular rewards.

Modular reasoning with LLM: the thought Tree (ToT) suggests that methods such as letting models "take a deep breath" and "think step by step" are now being extended to advanced methods of reasoning using parallel computing and heuristics.

Thought trees are a way to prompt language models to create reasoning path trees that may or may not converge to the correct answer.

The key innovation to realize the thinking tree is to divide the reasoning steps into blocks and prompt the model to create new reasoning steps.

The thought tree may be the first "recursive" cue technique to improve reasoning performance, and it sounds very close to the recursive self-improvement model that artificial intelligence security focuses on.

Https://arxiv.org/abs/2305.10601

Using the inference tree, you can apply different methods to score each vertex or node, or sample the final path.

It can be based on the minimum length of the most consistent answer, or complex things that require external feedback, which takes us in the direction of RLHF.

Play with the fine-grained reward tags in 24:00 game generation with the mind tree: the process reward Model (PRM) so far, most RLHF has been done by scoring the entire response of the model.

But for people with a RL background, this approach is disappointing because it limits the ability of the RL method to associate the values of each subcomponent of the text.

It was pointed out that, in the future, such multi-step optimization would take place at the level of multiple rounds of dialogue, but the whole process was still far-fetched because of the need for humans or some sources of cues to participate in the cycle.

This can be easily extended to self-game-style conversations, but it is difficult to give LLM a goal to transform it into a self-game dynamics of continuous improvement.

After all, most of the things we want to do with LLM are repetitive tasks, not nearly unlimited performance limits like go.

However, there is one LLM use case that can be naturally abstracted into included blocks of text, and that is step-by-step reasoning. And the best example is solving math problems.

In the past six months, the process reward model (PRM) has been a hot topic for RLHF staff.

There are many papers on PRM, but few will mention how to use them in combination with RL.

The core idea of PRM is to assign a score to each reasoning step, rather than a complete information.

In OpenAI's paper "Let's Verify Step by Step", there is such an example--

In the process, the feedback interface they use looks like this, which is very instructive.

In this way, the generation of reasoning problems can be more finely adjusted by sampling the maximum average reward or other indicators, rather than relying solely on a score.

Using N-optimal sampling (Best-of-N sampling), that is, the one that generates a series of times and has the highest score using the reward model, PRM performs better than the standard RM in reasoning tasks.

(note that it is the cousin of Rejection Sampling that "refuses to sample" in Llama 2. )

And so far, most PRM have only demonstrated their great role in reasoning. But if it is used for training and optimization, it will exert real power.

In order to create the richest optimization settings, you need to be able to generate multiple inference paths for grading and learning.

This is the opportunity for the mind tree to give full play to its talents.

The very popular mathematical model Wizard-LM-Math is trained with PRM: https://arxiv.org/ abs / 2308.09583, so what might Q * be? Nathan Lambert guesses that Q * seems to be using PRM to grade ToT inference data and then use Offline RL to optimize it.

This is not much different from existing RLHF tools, which use offline algorithms such as DPO or ILQL, which do not need to be generated from LLM during training.

The "trajectory" seen by the RL algorithm is the sequence of inference steps, so we can execute the RLHF in a multi-step manner rather than through the context.

Existing rumors suggest that OpenAI is using offline RL for RLHF, which does not seem to be a major leap forward.

Its complexity lies in collecting the right prompts, allowing the model to generate excellent reasoning, and most importantly, accurately grading tens of thousands of responses.

The rumored huge computing resource is the use of AI rather than humans to score each step.

Indeed, composite data is king, and using a tree rather than a single wide path (thinking chain) can give the right answer for more and more choices in the future.

If the rumors are true, the gap between OpenAI and other models will undoubtedly be terrible.

After all, most technology companies, such as Google, Anthropic, Cohere, etc., now create pre-training datasets using process monitoring or RLAIF-like methods, which can easily take thousands of GPU hours.

Data feedback from super-large-scale AI in the future

According to the rumors of foreign media The Information, the breakthrough of Ilya Sutskever enables OpenAI to solve the problem of data waste, so that there is enough high-quality data to train the next generation of new models.

And these data are computer-generated data, not real-world data.

In addition, the problem that Ilya has studied for many years is how to get language models such as GPT-4 to solve tasks that involve reasoning, such as mathematical or scientific problems.

Nathan Lambert says that if he guesses correctly, Q * is the generated synthetic reasoning data.

Through similar elimination sampling (screening based on RM scores), the best samples can be selected. Through offline RL, the generated reasoning can be improved in the model.

For those institutions with high-quality large models and a lot of computing resources, this is a virtuous circle.

Combined with the impression given by GPT-4, mathematics, code and reasoning should be the topics that benefit most from Q * technology.

What is the most valuable reasoning token? The eternal question in the minds of many AI researchers is: which applications are worth spending more on inferential computing?

After all, for most tasks (such as reading articles and summarizing emails), the improvements brought about by Q * may not be worth mentioning.

But for generating code, it is obviously worthwhile to use the best model.

Lambert says he has an ingrained intuition in his head that comes from discussions with people around him at the dinner table-using RLHF to train extended reasoning can improve downstream performance without letting the model think step by step.

If this is achieved in Q *, OpenAI's model will undoubtedly show a major leap forward.

Four possible core elements of Jim Fan:Q * Nathan posted a blog post a few hours ago and discussed a very similar idea: thought tree + process reward model. His blog lists more references, while I prefer the analogy with AlphaGo.

To understand the power of the combination of search and learning, Jim Fan says, we need to go back to 2016, a glorious moment in the history of artificial intelligence.

When you re-examine AlphaGo, you can see that it contains four key elements:

1. Strategic neural network (Policy NN, learning part): evaluate the possibility of winning each move, and choose a good one.

two。 Value neural network (Value NN, learning part): used to evaluate the chess game and predict the outcome from any reasonable layout.

3. Monte Carlo tree search (MCTS, search part): use strategic neural network to simulate a variety of possible paths from the current position, and then summarize the results of these simulations to determine the most promising path. This is a "slow thinking" link, in sharp contrast to the fast token sampling in the large language model (LLM).

4. The real signal that drives the whole system: in go, this signal is as simple as the binary tag "who wins" and is determined by a fixed set of rules of the game. You can think of it as an energy source that continues to drive the learning process.

So how do these components interact? AlphaGo learns through self-game (that is, against its previous version).

With the continuation of the self-game, both the strategy neural network and the value neural network are improved in iteration: as the strategy becomes more accurate in the choice of walking, the value neural network can also obtain higher quality data for learning. and then provide more effective feedback for the strategy. A stronger strategy will also help MCTS explore a better strategy.

These eventually form an ingenious "perpetual motion machine". In this way, AlphaGo can improve itself and finally beat human world champion Lee se-dol by 4-1 in 2016. Only by imitating human data, artificial intelligence can not reach the level of surpassing human.

Which four core components will be included for Q *? 1. Strategic Neural Network (Policy NN): this will be the most powerful GPT within OpenAI, responsible for implementing the thought process of solving mathematical problems.

two。 Value Neural Network (Value NN): this is another GPT that is used to evaluate the correctness of each intermediate reasoning step.

OpenAI published a paper called "Let's Verify Step by Step" in May 2023, written by prominent figures such as Ilya Sutskever, John Schulman and Jan Leike. Although it is not as famous as DALL-E or Whisper, it provides us with a lot of clues.

In this paper, the author puts forward the "process Supervision reward Model" (Process-supervised Reward Models,PRM), which provides feedback for every step in the thinking chain. The opposite is the result Supervision reward Model (Outcome-supervised Reward Models,ORM), which only evaluates the final overall output.

ORM is the original reward model of RLHF, but its granularity is too coarse to properly evaluate the various parts of a long response. In other words, ORM does not perform well in the distribution of merit and work. In the reinforcement learning literature, we call ORM a "sparse reward" (given only for the last time), while PRM is a "dense reward" that can more smoothly guide LLM towards the behavior we expect.

3. Search: unlike AlphaGo's discrete states and actions, LLM runs in a much more complex space (all reasonable strings). Therefore, we need to develop new search methods.

On the basis of CoT, some nonlinear variants have been developed in the research community:

-thought tree (Tree of Thought): it combines thought chain with tree search

-mind graph (Graph of Thought): combine the thought chain with the graph to get a more complex search operator

4. Real signal: (several possibilities)

(a) every math question has a known answer, and OpenAI may have collected a large amount of data from existing math exams or competitions.

(B) ORM itself can be used as a real signal, but it may be used to "lose the energy needed to sustain learning".

(C) formal verification systems, such as Lean theorem prover, can convert mathematical problems into programming problems and provide compiler feedback.

Like AlphaGo, strategy LLM and value LLM can advance each other through iterations and, where possible, learn from the annotations of human experts. The better strategy LLM will help the mind tree search to find better strategies, which in turn will collect better data for the next round of iterations.

Demis Hassabis mentioned earlier that DeepMind's Gemini will use a "AlphaGo algorithm" to enhance reasoning. Even if Q * is not what we think, Google will certainly use its own algorithm to catch up.

Jim Fan said that this is only part of the reasoning. There is no sign that Q * will be more creative in writing poetry, telling jokes or role-playing. In essence, improving creativity is a human thing, so natural data will still outperform synthetic data.

It's time to deal with the last chapter, and deep learning expert Sebastian Raschka says--

If you have to study Q-learning this weekend for any reason and happen to have a Machine Learning with PyTorch and Scikit-Learn on your bookshelf, it's time to deal with the last chapter.

Reference:

Https://www.interconnects.ai/p/q-star

Https://twitter.com/DrJimFan/status/1728100123862004105

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.