In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >
Share
Shulou(Shulou.com)11/24 Report--
[guide to Xin Zhiyuan] after ChatGPT was born, RLHF has become the focus of researchers' attention. Google's latest research suggests that without human tagging, AI tagging can achieve the same effect as RLHF.
Is it feasible to say that the "human" in RLHF is replaced?
The latest research by the Google team proposes to replace humans with large models for preference tagging, or AI feedback reinforcement learning (RLAIF).
Paper address: https://arxiv.org/ abs / 2309.00267 results show that RLAIF can produce the same improvement effect as RLHF without relying on human markers, with a success rate of 50%.
At the same time, Google research has once again proved that both RLAIF and RLHF have a success rate of more than 70 per cent over supervised fine-tuning (SFT).
Today, a key part of large-scale language model training is RLHF. Humans rate the quality of AI output to make the response more useful.
However, this requires a lot of effort, including exposing many tagging people to harmful content from AI output.
Since RLAIF can compete with RLHF, future models do not need human feedback and can be improved through self-circulation.
RLHF does not need humans at present, RLHF has become the core method of fine-tuning large models, including ChatGPT, Bard and other models all use this paradigm.
Specifically, RLHF is divided into three steps: pre-training a supervised fine-tuning LLM; to collect data to train a reward model; and fine-tuning the model with RL.
With RLHF, large models can be optimized for complex sequence-level goals, which are difficult to distinguish between traditional SFT.
However, a very real question is that RLHF needs large-scale and high-quality human tagged data, and whether these data can achieve a superior result.
Prior to Google's study, Anthropic researchers were the first to explore a reward model that uses AI preferences to train RL fine-tuning.
They first put forward RLAIF in "Constitutional AI" and found that LLM is highly consistent with human judgment, and even outperforms human beings in some tasks.
However, the study does not compare humans with artificial intelligence feedback, so there is no final answer as to whether RLAIF can replace RLHF.
Google's latest research is mainly to solve this problem.
In the model summary task, the researchers directly compared RLAIF and RLHF.
Given 1 text and 2 candidate answers, a preference annotation is given using off-the-shelf LLM.
Then, the reward model (RM) is trained according to LLM preference and contrast loss. Finally, through the reinforcement learning fine-tuning strategy model, the reward model is given.
So what's the difference between the RLAIF approach proposed by Google and Anthropic?
Google itself explained in the article
Google: train the reward model according to the preferences marked by AI, and then fine-tune the RL.
-Constitutional AI: improve the supervised learning model through iterations that require LLM to generate better responses in accordance with the Constitution.
What is the process of AI self-tagging and self-improving the RLAIF method proposed by Google in its latest research?
Preference tagging based on large language model
The researchers used "off-the-shelf" LLM to mark preferences between the two candidates.
This is a model for pre-training or instruction tuning for general purposes, but not fine-tuned for specific downstream tasks. Given a paragraph of text and two candidate summaries, LLM is asked to evaluate which summary is better. The input structure of LLM is as follows:
1. Preface
Introduction and description of the task at hand
two。 Multiple sample instances (optional)
The basic principles and preference judgment of a text, a pair of abstracts, ideas
3. Samples to be labeled
A paragraph of text and a pair of summaries to be annotated
4. End
The end string of the prompt LLM (for example, "Preferred Summary=")
After providing input to LLM, the researchers obtained the logarithmic probability of generating token "1" and "2" and calculated the softmax to obtain the preference distribution.
There are many ways to obtain preference annotations from LLM, such as decoding free-form responses from the model and heuristic extraction of preferences (e.g. output= "the first summary is better"), or representing the preference distribution as a single thermal representation (one-hot representation). However, the researchers did not try these alternatives because their methods have produced a high degree of accuracy.
The researchers experimented with two types of prefaces: the first is "Base", which simply asks "which abstract is better?" The second is "OpenAI", which mimics the rating instructions of the human preference annotator that generates the OpenAI TL;DR preference data set, and contains details about what constitutes a strong summary. This is shown in the following figure.
The researchers also tried contextual learning by adding a small number of samples to the prompts, where samples were manually selected to cover different topics. Resolve the position deviation.
Previous studies have shown that the order in which candidates are presented to LLM may affect LLM's determination of preferred candidates. The researchers found evidence of this positional deviation, especially for smaller dimensioned LLM.
In order to reduce the position deviation in preference tagging, the researchers reason twice for each pair of candidates, and the sequence of secondary reasoning in which the candidates are submitted to the LLM is the opposite. Then the results of the two inferences are averaged to obtain the final preference distribution.
Thinking chain reasoning
The researchers tried to draw COT reasoning from the AI annotator to improve consistency with human preferences.
The researchers replaced the standard end hint (for example, replace "Preferred Summary=" with "Consider the coherence, accuracy, coverage, and over-all quality of each summary and explain which one is better." Rationale: "), and then decode a reply from LLM.
Finally, the researchers concatenated the original prompt, response, and the original ending string "Preferred Summary=" and followed the scoring process in Section 3.1 to obtain the preference distribution. The specific process is shown in the figure below.
In the zero sample prompt, LLM does not give an example of what reasoning should look like, while in a small number of sample tips, the researchers provide an example of COT reasoning that the model should follow. See the following figure for an example.
Self-consistency (Self-Consistency)
For thought chain cues, the researchers also tried self-consistency-a technique that improves thought chain reasoning by sampling multiple reasoning paths and aggregating the final answers at the end of each path.
The basic principles of multiple thought chains are sampled using non-zero decoding temperature (non-zero decoding temperature), and then the LLM preference distribution of each thought chain is obtained according to the method in the previous section. Then the results are averaged to obtain the final preference distribution.
Reinforcement Learning based on AI feedback
After the preference is marked by LLM, the reward model (RM) is trained to predict the preference. Because the researchers' method produces soft tagging (Soft Label), they use the cross-entropy loss (cross-entropy loss) of the softmax of the reward score generated by RM, rather than the loss mentioned in the reward model.
Softmax converts the unbounded fraction (unbounded scores) of RM into a probability distribution.
Training RM on AI tagged datasets can be seen as a form of model distillation, especially since researchers'AI annotators are generally larger and stronger than RM.
Another way is to bypass RM and directly use AI feedback as the reward signal in RL, although this method is more expensive to calculate because the AI annotator is larger than RM.
Through the trained RM, the researchers used a modified version of the Advantage Actor Critic (A2C) algorithm for language modeling for reinforcement learning.
Appraise
The researchers evaluated their results on three indicators-AI marker alignment, pairing accuracy and success rate.
AI annotator alignment is used to measure the accuracy of AI tagging preferences relative to human preferences.
For a single example, convert the preference of the soft artificial intelligence annotation to a binary representation. If the tag is consistent with the target human preference, assign 1, otherwise assign 0.
Pairing accuracy is a measure of the accuracy of a trained reward model relative to a group of retained human preferences.
Given the shared context and a pair of candidate responses, if RM scores the preferred candidate higher than the non-preferred candidate according to the human annotation, the pairing accuracy is 1. Otherwise, the value is 0. This number is the average of several examples to measure the overall accuracy of RM.
The success rate assesses the end-to-end quality of two strategies by measuring the frequency of human preference for one strategy.
Given one input and two generation results, the human annotator chooses which one to generate the result first. The percentage of instances in which policy An is superior to policy B is called the winning rate of A to B.
The lab details researchers used filtered Reddit TL;DR datasets managed by OpenAI. TL;DR contains about 3 million posts from Reddit, covering a variety of topics (also known as "subreddits") and abstracts of posts written by the original author.
The data is also filtered by OpenAI to ensure high quality, including the use of whitelists of Reddit themes that can be understood by the general public.
In addition, only posts with 24 to 48 annotations in the summary are included. The filtered dataset contains 123169 posts, about 5 per cent of which are used as validation sets.
More details about the dataset can be found in the original paper. In addition, OpenAI collates a human preference dataset from the filtered TL;DR dataset.
For a given post, two candidate summaries are generated according to different strategies, and the annotator is asked to grade their favorite summaries. The total dataset contains approximately 92k pairwise comparisons.
LLM dimension
In order to evaluate the effectiveness of AI tagging techniques (such as hints, self-consistency), the researchers selected examples from the TL;DR preference dataset, in which human annotators preferred abstracts with higher confidence.
The researchers evaluated AI annotator alignment on a random 15% subset of the training segmentation of the dataset to achieve faster experimental iterations, generating 2851 evaluation examples.
For reward model training, the complete training segmentation of TL;DR preference data sets is marked by LLM and used for training, regardless of confidence score.
Model training
The researchers used PaLM 2 Extra-Small (XS) as the initial checkpoint to train the SFT model on OpenAI-filtered TL;DR data sets.
The researchers then initialized the RM from the SFT model and trained them on OpenAI's TL;DR human preference dataset.
For the results in tables 1 and 5.1, the researchers used PaLM 2L to generate AI tagged preferences, used the "OpenAI + COT 0-shot" hint (, which is not self-consistent, and then trained the RM dataset on complete preferences.
For reinforcement learning, the researchers used Advantage Actor Critic (A2C) to train strategies. Both policy and value models are initialized from the SFT model. The researchers used filtered Reddit TL;DR data sets as the initial state to roll out their strategy.
Human assessment
The researchers collected 1200 human ratings to evaluate RLHF and RLAIF strategies. For each rating task, the evaluator receives a post and four summaries generated based on different strategies (RLAIF, RLHF, SFT, and Human reference) and asks them to be ranked in order of quality, without any connection.
The post is taken from the reserved set of TL;DR supervised fine-tuning dataset, which is not used for any other evaluation. Once these rankings are collected, the success rates of any two strategies can be calculated.
The winning rate is 50%, draw with RLAIF vs. RLHF
At the beginning of the article, Google has introduced the advantages of comparing RLAIF with RLHF, and the results show that the two methods have similar performance.
Specifically, in 71% of cases, human evaluators prefer RLAIF to baseline SFT. In 73% of the cases, RLHF is better than SFT.
The researchers also directly compared the winning rates of RLAIF and RLHF and found that their popularity was the same-that is, they both had a winning rate of 50%.
In order to further understand the differences between the two strategies, Google made a qualitative comparison of the abstracts generated by them.
In addition, they compare RLAIF and RLHF feeds with manual reference summaries. The summary generated by RLAIF is better than the reference summary in 79% of the cases, and the RLHF result is better than the reference summary in 80% of the cases.
It can be seen that the difference in the winning rate between RLAIF and RLHF and the reference summary is only 1%, and there is no significant difference.
Notably, the researchers also found that RLHF strategies tend to hallucinate more frequently than RLAIF, such as the red text in the above table.
After controlling the length of the summary, the RLAIF and RLHF strategies are still better than the baseline SFT, and achieve similar success rates.
These results show that RLAIF does not need to rely on manual labeling and is a feasible alternative to RLHF.
Tips and skills
In using hint techniques, the Google team tried three types of cue techniques: preamble specificity, CoT, and small sample context learning.
It is found that the AI annotator can achieve 78% consistency through detailed OpenAI preface prompting and CoT reasoning.
On the other hand, situational learning does not improve accuracy, and may even make it worse.
Self-consistency
The researchers used 4 and 16 samples to conduct self-consistent experiments with a decoding temperature of 1.
When T = 1 is used to sample multiple thought chain principles, the results are less consistent with human preferences.
Scale of large model annotator
It is also found that expanding the parameter scale of the large model annotator may result in higher quality preference labeling.
Number of preference exampl
How does the accuracy of the reward model change with the training example?
The researchers found that after thousands of sample training, the performance of the reward model was close to that of the complete data set.
Conclusion the researchers have demonstrated that RLAIF can produce improvements that are comparable to those of RLHF without relying on human markers.
Although this work highlights the potential of RLAIF, it still has some limitations.
First of all, this study only discusses the summary tasks, and the generalization of other tasks needs to be further studied.
Second, the researchers did not estimate whether LLM reasoning has an advantage over manual tagging in terms of economic cost.
In addition, there are some interesting questions worth studying, such as whether the combination of RLHF and RLAIF can be better than a single approach, how effective it is to use LLM to allocate awards directly, whether improved AI annotator alignment will translate into an improved final strategy, and whether the strategy can be further improved by using an annotator of the same size as the policy model (that is, whether the model can "improve itself").
There is a heated discussion among netizens that Google has published two papers on RL:
1. RLAIF: training a reward model similar to human feedback
2. ReST: using generative models to promote self-training to combine these two papers can satisfy data-hungry artificial intelligence algorithms.
Half a month ago, Google DeepMind just came up with a new algorithm, ReST, to align large-scale language models with human preferences.
Specifically, the off-line reinforcement learning method is used to improve the translation quality of large-scale language models in order to better meet human preferences.
According to qualitative tests, Anthropic's Claude model appears to be weaker than GPT-4, according to one researcher. This may be caused by the RLHF / RLAIF method or pre-training. It is not clear whether these methods are more effective in generalization in practical applications, even if they perform better on academic benchmarks.
I wouldn't say that this reduces the importance of manual tagging, but one thing is certain that RL with artificial intelligence feedback can reduce costs. Manual tagging is still extremely important for generalization, and the RLHF+RLAIF hybrid approach is better than any single method.
Most netizens think that the paper is a great breakthrough, but some netizens feel that there seems to be no essential difference between this and the RLAIF in Constitute Claude proposed by Anthropic a few months ago.
Reference:
Https://arxiv.org/abs/2309.00267
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.