Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Without RLHF, it is comparable to GPT-4,Meta to release LIMA 65BJI 1000 samples and its performance soars, LeCun turns to like

2025-02-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Share

Shulou(Shulou.com)11/24 Report--

RLHF is not that important! Meta's latest 65 billion parameter model, LIMA, uses only 1000 samples to achieve performance comparable to that of GPT-4.

Everyone knows that the secret weapon that makes ChatGPT dominate the world is human feedback reinforcement learning (RLHF).

Now, the explosive fire research of Meta AI and other institutions LIMA directly breaks this rule, saying bluntly that RLHF is not that important!

As soon as the paper came out, it directly blew up in the AI circle!

Even LeCun can't help tweeting: LIMA:LLaMa-65B+1000 monitors sample = GPT-4 / Bard performance.

As the title says, LIMA is "Less is More for Alignment", suggesting a strong pre-trained AI model that is sufficient to achieve high-quality results with a few samples.

LIMA, on the other hand, fine-tuned LLaMa-65B on only 1000 carefully selected samples, and achieved performance comparable to GPT-4 and Bard without the need for RLHF.

In the https://arxiv.org/ abs / 2305.11206 paper, the researchers called this breakthrough the "surface alignment hypothesis" (Superficial Alignment Hypothesis).

Experiments show that the large language model has acquired most of the knowledge in the pre-training stage, and only a limited number of instructions to fine-tune the data is enough to teach the model to produce high-quality content.

High-quality data can overcome small sample size? How much does it cost to train such a model, does it mean that small LLM players can compete with OpenAI / Google?

Some netizens also questioned that if GPT-4 beat LIMA in 57% of the cases, it could be said that the performance was the same?

RLHF is not king? After pre-training, the large language model can predict the next token on a large scale and enable it to learn general representations. These representations can be transferred to almost any language understanding or generation task.

In order to achieve this transfer, various methods of "aligning" language models have been proposed, focusing on instruction tuning on million-level token.

Recently, reinforcement learning (RLHF) is widely used from human feedback. This feedback was collected in millions of interactions with human tagging.

ChatGPT's impressive performance is mainly due to RLHF. According to OpenAI's idea, RLHF is divided into three steps.

However, the existing alignment methods are expensive and require a lot of computing power and specialized data to achieve the same performance as ChatGPT.

Meta AI is retrograde, proving that a pre-training language model can achieve powerful performance by simply fine-tuning a carefully selected 1000 samples.

Here, the researchers propose the "surface alignment hypothesis" (Superficial Alignment Hypothesis), which assumes that "alignment" can be a simple process.

In this process, the knowledge and ability of the model are learned almost entirely during pre-training, while "alignment" only tells the model the style or format in which it learns to interact with the user.

To test this hypothesis, Meta and other researchers selected 1000 samples that were similar to real user prompts and high-quality responses.

They manually selected from other research papers, WikiHow, StackExchange and Reddit, and the total amount of training data was about 750000 token.

Sources of training tips (input), response (output), and test tips

In addition, the researchers manually compiled 250 samples of tips and responses, while optimizing the diversity of tasks.

Finally, the researchers fine-tuned the pre-trained LLaMa 65B model on 1000 sample sets and conducted a human evaluation.

The evaluation result, Meta, compared LIMA with five model benchmarks: (responses from all benchmarks were sampled during April 2023)

Alpaca 65B Alpaca-A large model obtained after fine-tuning LLaMa 65B using 52000 samples

DaVinci003-- 's Big language Model based on RLHF training

Bard-- is based on Google's PaLM model

52B Parametric Model of Constitutional AI training for Claude-- through reinforcement Learning

GPT-4-- currently uses the strongest model trained by RLHF

To compare LIMA with other SOTA models, Meta generates a single response for each test prompt.

Human participants are then asked to compare the output of LIMA with each benchmark and mark which one they prefer.

In the study of human preference, although the amount of training data of Alpaca 65B is 52 times that of LIMA, its output is often lower than that of LIMA.

Surprisingly, the same is true of DaVinci003, albeit to a lesser extent. The model uses RLHF for training, which is supposed to be a better method of alignment.

Bard produced better answers than LIMA in 42% of the time. It also means that for the remaining 58 per cent of the time, LIMA's response is at least as good as Bard's.

Finally, the researchers found that although Claude and GPT-4 generally performed better than LIMA, in some cases, LIMA actually produced better answers.

In addition, ironically, 19% of the time in GPT-4 's preference study, GPT-4 preferred the output of LIMA.

Meta defines this discovery as the "surface alignment hypothesis" (Superficial Alignment Hypothesis).

It shows that the so-called pre-training alignment phase is mainly to let the model learn a specific style or format that can be recalled by the model when interacting with the user.

Therefore, "fine-tuning" is more about style than substance.

The results of LIMA show that, in fact, complex problems such as aligning and fine-tuning AI models can be solved by using simple methods.

This is in sharp contrast to particularly tedious and complex fine-tuning processes such as OpenAI's RLHF.

However, LIMA is not a panacea. Meta believes that this approach has two obvious limitations:

First, building datasets with high-quality examples is a very challenging approach that is difficult to extend.

Second, LIMA is not as powerful as existing product models, such as GPT-4.

The team said that although the results of LIMA generation are of high quality in most cases. But a "confrontational hint" or an "unlucky sample" will still give the model an unsatisfactory answer.

Yann LeCun takes a pragmatic view of the relative devaluation of efforts behind GPT-4 and similar models.

He sees large language models as a recent element that will not work "without major changes" at least in the medium term.

Above, the main assessment is to evaluate LIMA based on state-of-the-art models, but it should be clear that some of these models have actually used prompts from millions of real users during training.

In response, the researchers conducted an absolute assessment by manually analyzing 50 random examples.

Each example is marked into three categories: Fail, the response does not meet the requirements of the prompt; Pass, the response meets; and Excellent, which provides an excellent response to the prompt.

The experimental results show that 50% of the LIMA responses are considered excellent, and it can follow 44 of all 50 analysis tips.

Below, LIMA's output for parenting advice and examples of generated recipes.

In addition, how does a model that is only fine-tuned on 1000 samples perform in multiple rounds of conversation?

On the zero sample, LIMA's response is surprisingly coherent and quotes information from the previous conversation. Three times out of 10 conversations, LIMA failed to follow the prompts.

In order to improve the dialogue ability, the researchers collected more than 30 rounds of conversations. Ten of them are written manually by the author, 20 from Stack Exchange, and edited according to the assistant style.

The researchers used a combination of 1030 examples to fine-tune the pre-training model to get a new version of LIMA and conducted 10 real-time conversations with the same prompts.

The experiment found that after adding these 30 examples, the generation quality was significantly improved, and the proportion of high-quality response increased from 45.2% to 76.1%!

Through the ablation experiment, the LIMA team studied the influence of the diversity, quality and quantity of training data.

Meta found that for alignment purposes, improving input diversity and output quality had measurable positive effects, while increasing quantity alone did not.

The experimental setup team fine-tuned a LLaMa model with 7 billion parameters on various data sets and controlled the same hyperparameters.

The team extracted five responses to each test set prompt and assessed the quality of the response by asking ChatGPT (GPT-3.5 Turbo) to rate the help of the response on a 1-6 Likert scale.

Diversity in order to test the effects of diversity and control quality and quantity, the team compared the training results on quality-filtered Stack Exchange data and wikiHow data.

Figure 5 shows that a greater variety of Stack Exchange data significantly improves the performance of the model.

Quality to test the impact of response quality, the team extracted 2000 examples without any quality or style filtering from Stack Exchange and compared the models trained on this dataset and the filtered dataset.

Figure 5 shows that there is a 0.5 point difference between models trained on filtered and unfiltered data sources.

Quantity in many machine learning settings, the strategy of increasing the number of examples will be adopted to improve performance.

To test its impact, the team extracted an exponentially growing training set from Stack Exchange.

But in fact, as shown in figure 6, the doubling of the training set does not improve the quality of the response.

This also implies that the size rule of alignment is not necessarily affected only by quantity, but is more likely to increase the diversity of prompts while maintaining a high-quality response.

The author introduces Chunting Zhou as a research scientist at Meta AI.

In May 2022, she received her doctorate from the Institute of language Technology at Carnegie Mellon University, where she worked on natural language processing under the mentor of Graham Neubig. Zhou is mainly interested in the cross-cutting areas of natural language processing and machine learning, and is interested in developing methods that are robust to distributed changes, so that learning models can be unified in various groups.

In addition, Zhou also studies modeling and its application in natural language processing tasks.

Reference:

Https://arxiv.org/abs/2305.11206

This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

IT Information

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report