That's all. In order to improve the performance of the language model, Google has come up with a new idea. 04/19 Update SLTechnology News&Howtos

That's all. In order to improve the performance of the language model, Google has come up with a new idea.

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

The language model is getting stronger and bigger. How to improve the performance of the model without expanding the scale? Google AI has come up with a good idea of "combining the strong and the strong".

In recent years, language model (LM) has become more prominent in the study of natural language processing (NLP) and more and more influential in practice. In general, increasing the size of the model has been shown to improve performance in a range of NLP tasks.

However, the challenge of scaling up the model is also obvious: training new and larger models requires a lot of computing resources. In addition, the new model is often trained from scratch and can not make use of the training weight of the previous model.

In response to this problem, Google researchers have explored two complementary ways to significantly improve the performance of existing language models without consuming a lot of additional computing resources.

First, in the article "Transcending Scaling Laws with 0.1% Extra Compute", the researchers introduced UL2R, a lightweight second-stage pre-training model that uses a hybrid enoisers target. UL2R improves the performance of a range of tasks, even releasing burst performance on tasks that previously had near-random performance.

Paper link: https://arxiv.org/ pdf / 2210.11399.pdf also, in "Scaling Instruction-Finetuned Language Models", discusses the problem of fine-tuning the language model on an instruction-worded dataset, a process we call "Flan". This method not only improves the performance, but also improves the usability of the language model to user input.

Link to paper: https://arxiv.org/ abs / 2210.11416 finally, Flan and UL2R can be combined as complementary technologies in a model called Flan-U-PaLM 540B, which performs 10 per cent better than the unadjusted PaLM 540B model in a series of challenging evaluation benchmarks.

UL2R training traditionally, most language models are pre-trained on causal language modeling targets, so that the models can predict the next word in the sequence (such as GPT-3 or PaLM) or denoising targets, in which the model learns to recover the original sentence from the damaged word sequence (such as T5).

Although there are some tradeoffs in language modeling objectives, that is, the causal language model performs better in long sentence generation, while the language model trained on denoising targets performs better in fine tuning, in previous work, researchers have shown that mixed enoisers targets including these two goals can achieve better performance in both cases.

However, it is computationally difficult to pre-train large language models from scratch on different goals. Therefore, we propose UL2 repair (UL2R), which is an additional stage of continuing pre-training with UL2 targets and requires relatively less computation.

We apply UL2R to PaLM and call the resulting new language model U-PaLM.

In the empirical evaluation, we find that with only a small amount of UL2 training, the model will be greatly improved.

For example, by using UL2R on the intermediate checkpoint of the PaLM 540B, you can achieve the performance of the PaLM 540B at the final checkpoint while using twice the amount of computation. Of course, applying UL2R to the final PaLM 540B checkpoint will also bring huge improvements.

The calculation and model performance of PaLM 540B and U-PaLM 540B on 26 NLP benchmarks are compared. U-PaLM 540B continues to train PaLM, with a very small amount of computation, but a great improvement in performance.

Another benefit of using UL2R is that it performs much better on some tasks than models trained purely on causal language modeling goals. For example, there are many BIG-Bench tasks that have so-called "emerging capabilities", that is, capabilities that are available only in sufficiently large language models.

Although the most common way to discover emerging capabilities is by increasing the size of the model, UL2R can actually stimulate emerging capabilities without increasing the size of the model.

For example, in the navigation task of BIG-Bench, to measure the ability of the model to track the state, except for U-PaLM, all models have less than 10 ^ 23 training FLOPs. Another example is BIG-Bench 's Snarks task, which measures the ability of models to detect satirical language.

For the two capabilities from BIG-Bench, emerging task performance is demonstrated, and U-PaLM achieves emerging performance at a smaller model scale because of the use of UL2R goals.

Instruction tuning in the second paper, we discussed instruction fine-tuning, which involves fine-tuning LM on a set of instruction-worded NLP data sets.

In our previous work, we applied instruction fine-tuning to the 137B parameter model of 62 NLP tasks, such as answering a small question, classifying emotions expressed in movies, or translating sentences into Spanish.

In this work, we fine-tune the language model of 540B parameters on tasks over 1.8K. In addition, previous work only fine-tuned language models with a small number of examples (such as MetaICL) or zero-example language models without examples (such as FLAN, T0), while we fine-tuned the combination of both.

We also include thought chain fine-tuning data, which enables the model to conduct multi-step reasoning. We call our improved method "Flan" and are used to fine-tune the language model.

It is worth noting that even if the task is fine-tuned at 1.8K, Flan uses only a small amount of computation compared to pre-training (only 0.2% of the pre-training calculation is needed for PaLM 540B Flan).

Fine-tune the language model on 1.8K tasks in the form of instructions and evaluate the model on new tasks, which are not included in the fine-tuning. Fine-tune the model with / without examples (i.e. 0-shot and few-shot) and with / without thought chain, so that the model can be extended in a series of evaluation scenarios.

In this paper, a series of LM instructions-fine-tuning are carried out in order to study the common effect of expanding the size of the language model and increasing the number of fine-tuning tasks at the same time.

For example, for PaLM class language models, including 8B, 62B, and 540B parameter specifications. Our model was evaluated on four challenging benchmark evaluation criteria (MMLU, BBH, TyDiQA, and MGSM) and found that expanding the number of parameters and fine-tuning the number of tasks could improve performance on new tasks that had not been seen before.

Expanding the parameter model to 540B and fine-tuning tasks using 1.8K can improve performance. The y-axis in the figure above is the normalized mean of the four evaluation kits (MMLU, BBH, TyDiQA, and MGSM).

In addition to better performance, instruction fine-tuning LM can respond to user instructions during reasoning without the need for a small number of examples or prompt projects. This makes LM more user-friendly in a series of inputs. For example, LM without instruction fine-tuning can sometimes repeat input or fail to follow instructions, but instruction fine-tuning can mitigate this error.

Our instruction fine-tuning language model Flan-PaLM is more responsive to instructions than the PaLM model without instruction fine-tuning.

Finally, we show that UL2R and Flan can be combined to train the Flan-U-PaLM model.

Because Flan uses new data from NLP tasks and is able to trace zero instructions, we use Flan as the second choice after UL2R.

We evaluated the four benchmark suites again and found that the Flan-U-PaLM model is better than the PaLM model with only UL2R (U-PaLM) or only Flan (Flan-PaLM). In addition, when combined with the chain of thinking and self-consistency, Flan-U-PaLM reached a new SOTA on the MMLU benchmark, with a score of 75.4%.

Compared to using only UL2R (U-PaLM) or only Flan (Flan-U-PaLM), combining UL2R with Flan (Flan-U-PaLM) results in the best performance: the normalized average of the four evaluation suites (MMLU, BBH, TyDiQA, and MGSM).

Generally speaking, UL2R and Flan are two complementary methods to improve the pre-trained language model. UL2R uses the same data to adapt LM to denoisers's mixed goals, while Flan uses training data from more than 1.8K NLP tasks to teach the model to follow instructions.

As language models become larger, technologies like UL2R and Flan that improve general performance without a lot of computation may become more attractive.

Reference:

Https://ai.googleblog.com/2022/11/better-language-models-without-massive.html

This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era), editor: David

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.