Easy to create a home version of GPT-4, Microsoft open source fine-tuning instruction set: the effect is as good as the original, and can be used in both Chinese and English 04/22 Update SLTechnology News&Howtos

Easy to create a home version of GPT-4, Microsoft open source fine-tuning instruction set: the effect is as good as the original, and can be used in both Chinese and English

2025-04-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

Lack of data is not a problem, just use the instructions generated by GPT-4 directly. I'm afraid the tagger will lose his job!

"Instruction" is a key factor in the breakthrough of ChatGPT model, which can make the output of language model more in line with "human preference".

But the labeling of instructions takes a lot of manpower, and even with the open source language model, it is difficult for underfunded academic institutions and small companies to train their own ChatGPT.

Recently, using the previously proposed Self-Instruct technology, Microsoft researchers tried to use the GPT-4 model to automatically generate fine-tuning instruction data for the language model for the first time.

Links to papers: https://arxiv.org/ pdf / 2304.03277.pdf

Code link: https://github.com/Instruction-Tuning-with-GPT-4 / GPT-4-LLM

The experimental results on the open source LLaMA model based on Meta show that the 52000 English and Chinese instruction-following data generated by GPT-4 perform better than the instruction data generated by the previous state-of-the-art model. The researchers also collect feedback and compare data from GPT-4 for comprehensive evaluation and reward model training.

The training data collection researchers reused 52000 instructions used in the Alpaca model issued by Stanford University, each of which described the tasks that the model should perform and followed the same prompting strategy as Alpaca, considering both input and non-input as an optional context or input for the task, and used a large language model to output answers to the instructions.

In the Alpaca dataset, the output is generated using GPT-3.5 (text-davinci-003), but in this paper, the researchers chose to use GPT-4 to generate the data, including the following four datasets:

1. English Instruction-Following Data: for the 52000 instructions collected in Alpaca, an English GPT-4 answer is provided for each instruction.

The future work is to follow the iterative process, using GPT-4 and self-instruct to build an entirely new dataset.

two。 English Instruction-Following Data: use ChatGPT to translate 52000 instructions into Chinese, and require GPT-4 to answer these instructions in Chinese, so as to build a Chinese instruction-following model based on LLaMA, and study the cross-language generalization ability of instruction tuning.

3. Comparative data (Comparison Data): GPT-4 was asked to provide a score of 1 to 10 for their responses, and to train the reward model by scoring responses from the GPT-4, GPT-3.5 and OPT-IML models.

4. Answers to unnatural instructions: GPT-4 's answers are decoded on a data set of 68000 (instruction, input, output) triples, which are used to quantify the size gap between GPT-4 and the instruction-tuned model.

The statistical researchers compared the English output reply sets of GPT-4 and GPT-3.5: for each output, root verbs (root verb) and direct object nouns (direct-object noun) were extracted, and the unique frequency of verb-noun pairs was calculated on each output set.

Verb-noun pairs with a frequency higher than 10

25 pairs of verbs-nouns with the highest frequency

The comparison of the frequency distribution of the length of the output sequence shows that GPT-4 tends to generate longer sequences than GPT-3.5, and the long tail of GPT-3.5 data in Alpaca is more obvious than the output distribution of GPT-4. It may be because the Alpaca data set involves an iterative data collection process, and similar instruction instances are removed in each iteration, which is not found in the current one-time data generation.

Although this process is simple, the instruction-following data generated by GPT-4 shows more powerful alignment performance.

Instruction tuning language model Self-Instruct tuning researchers have obtained two models based on LLaMA 7B checkpoint supervised fine tuning: LLaMA-GPT4 is trained on 52,000 English instruction-following data generated by GPT-4, and LLaMA-GPT4-CN is trained on 52,000 Chinese instruction-following data of GPT-4.

Two models are used to study the data quality of GPT-4 and the cross-language generalization of LLMs for instruction tuning in a language.

The reward model for reinforcement learning (Reinforcement Learning from Human Feedback,RLHF) from human feedback is designed to make LLM behavior consistent with human preferences, so as to make the output of the language model more useful to human beings.

A key component of RLHF is reward modeling, whose problem can be expressed as a regression task to predict the reward score for a given prompt and reply. This method usually requires large-scale comparative data, that is, to compare the responses of two models with the same prompt.

Existing open source models, such as Alpaca, Vicuna and Dolly, do not use RLHF because of the high cost of tagging and comparing data, and recent research shows that GPT-4 can identify and fix its own errors and accurately judge the quality of responses.

To facilitate the study of RLHF, the researchers used GPT-4 to create comparative data; to assess the quality of the data, the researchers trained a reward model based on OPT 1.3B to rate different responses: for a prompt and K responses, GPT-4 provided a score between 1 and 10 for each reply.

Experimental results it is still a difficult task to evaluate the performance of the self-instruct tuning model for tasks never seen before on GPT-4 data.

Since the main goal is to evaluate the model's ability to understand and comply with various task instructions, in order to achieve this, the researchers use three types of evaluations, and the results show that "using GPT-4 to generate data" is an effective method for tuning large language model instructions compared with the data automatically generated by other machines.

In order to assess the alignment quality of large language models tuned by the instruction, the researchers followed the previously proposed alignment criteria: if an assistant is helpful, honest, and harmless (HHH), it is aligned with human assessment criteria, which are also widely used to assess the degree of consistency between artificial intelligence systems and human values.

Helpfulness: whether it can help humans achieve their goals, a model that can accurately answer questions is helpful.

Honesty: whether to provide true information and express its uncertainty if necessary to avoid misleading human users, a model that provides false information is dishonest.

Innocuity (harmlessness): whether it will not cause harm to human beings, a model that produces hate speech or advocates violence is not harmless.

Based on the HHH alignment standard, the researchers use crowdsourcing platform Amazon Mechanical Turk to manually evaluate the results of the model generation.

The two models proposed in this paper have been fine-tuned on the data generated by GPT-4 and GPT-3, respectively. We can see that LLaMA-GPT4 with a proportion of 51.2% is much better than Alpaca (19.74%), which is fine-tuned on GPT-3, while under the standards of honesty and innocuity, it is basically in a draw, and GPT-3 is slightly better.

When compared with the original GPT-4, it can be found that the two are quite consistent in three standards, that is, the LLaMA performance after tuning the GPT-4 instruction is similar to that of the original GPT-4.

GPT-4 automatic evaluation was inspired by Vicuna. The researchers also chose GPT-4 to evaluate the quality of answers generated by different chat robot models to 80 unseen questions, collected responses from LLaMA-GPT-4 (7B) and GPT-4 models, and obtained answers from other models from previous studies, and then asked GPT-4 to rate the quality of responses between the two models, ranging from 1 to 10. The results are compared with other strong competitive models (ChatGPT and GPT-4).

The evaluation results show that the feedback data and reward model are effective to improve the performance of LLaMA; the performance of LLaMA instruction tuning with GPT-4 is often higher than that of text-davinci-003 tuning (i.e. Alpaca) and non-tuning (i.e. LLaMA). The performance of 7B LLaMA GPT4 is better than that of 13B Alpaca and LLaMA, but there is still a gap compared with large commercial chat robots such as GPT-4.

When we further study the performance of the Chinese chat robot, we first use GPT-4 to translate the chat robot's question from English into Chinese, and use GPT-4 to get the answer, we can get two interesting observations:

1. It can be found that the relative score indicators of GPT-4 evaluation are quite consistent, whether in different opponent models (i.e. ChatGPT or GPT-4) and languages (i.e. English or Chinese).

two。 As far as the results of GPT-4 are concerned, the translated responses perform better than those generated in Chinese, probably because GPT-4 is trained in a richer English corpus than Chinese, so it has stronger English instruction-following ability.

Unnatural instruction evaluation (Unnatural Instruction Evaluation)

In terms of average ROUGE-L scores, Alpaca is better than LLaMA-GPT4 and GPT-4, and it can be noted that LLaMA-GPT4 and GPT4 gradually perform better as the ground truth reply length increases, and eventually show higher performance when the length exceeds 4, which means that instructions can be better followed when the scene is more creative.

In different subsets, LLaMA-GPT4 behaves much the same as GPT-4; when the sequence length is short, both LLaMA-GPT4 and GPT-4 can generate responses with simple basic fact answers, but add additional words to make the reply more like a chat, which may result in a lower ROUGE-L score.

Reference:

Https://arxiv.org/pdf/2304.03277.pdf

This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.