1.5 billion parameters! The birth of the strongest general-purpose NLP model in history: winning 7 big data's best records 02/13 Update SLTechnology News&Howtos

1.5 billion parameters! The birth of the strongest general-purpose NLP model in history: winning 7 big data's best records

2026-02-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Source: openai

Reproduced from: Xin Zhiyuan, can not be reproduced again without permission

[introduction] the strongest "general" NLP model in history: OpenAI recently introduced a large-scale unsupervised NLP model they trained on the official blog, which can generate coherent text paragraphs, refresh the benchmark of 7 big data sets, and complete many different language modeling tasks such as reading comprehension, question and answer, and machine translation without pre-training.

OpenAI introduced their new NLP model on the official blog today, refreshing the SOTA (current best results) of 7 big data sets, and being able to directly perform the most basic NLP tasks such as reading comprehension, machine translation, question and answer, and text summary without any training in domain knowledge-related data.

Being able to complete a variety of different tasks without pre-training and achieve good results is equivalent to overcoming "catastrophic forgetting", which is simply the "universal" model that deep learning researchers dream of!

If Google's BERT represents NLP in a new era of pre-training models, OpenAI uses this result to prove that with extraordinary data and computing power, it can achieve things that were previously unimaginable.

For example, computing power, according to Smertiy, who participated in the OpenAI reinforcement learning study, the new model uses 256 Google TPU v3s (no specific training time is disclosed), and the training price is $2048 per hour.

The strongest "general" NLP model in history: 1.5 billion parameters galloping 40GB network data

OpenAI's NLP model is based on Transformer, has 1.5 billion parameters, and is trained with a dataset containing 8 million web pages for one purpose:

Based on the existing information, predict what the next word will be.

The name of the new model is GPT-2, which is a direct extension of the unsupervised NLP model GPT released by OpenAI last year. The parameters and training data used in the new model have increased by more than 10 orders of magnitude.

Because the capacity of the model is large enough and there is enough training data, on the test set with 40GB network data, GPT-2 can simply "predict what the next word is" to complete a variety of NLP tasks, showing a strong generalization ability.

At present, the mainstream method of constructing machine learning system is supervised learning-collecting data, that is, feeding the model an "ideal" combination of input and output, making the model imitate the "routine", and giving similar results on the new test data set. This method performs well in specific domain tasks, but the disadvantage is that once it is changed to other tasks, such as applying the model that performs well on the question and answer data set to reading comprehension, the model cannot adapt, that is, the generalization ability is very poor.

In this regard, OpenAI researchers boldly speculate that the reason for the poor generalization ability of the current machine learning system is precisely because the model is limited to specific domain data sets to do specific task training.

At the same time, the existing multi-task model research shows that it is difficult to achieve effective task expansion simply by increasing the number of training samples. NLP researchers are increasingly using self-attention module transfer learning to build multi-task learning models.

Therefore, OpenAI researchers combine the above two ideas, use the self-attention module to transfer learning on the basis of more general data sets, and then get a model that can perform many different NLP tasks in the case of zero-shot without adjusting any participation or model structure, that is, GPT-2 mentioned above.

In view of its strong ability and the danger of possible abuse, OpenAI did not publish the GPT-2 model and code, but only published a sample model and code with only 117m parameters for interested researchers to learn and reference: https://github.com/openai/gpt-2

Of course, the specific model structure of GPT-2 OpenAI did not elaborate this time, they set aside half a year to solicit opinions from the academic community. In the published paper "Language Models are Unsupervised Multitask Learners", the researchers of OpenAI introduced the ideas and methods of model construction.

As for the specific computing power, it is not mentioned in the paper, according to the above data on Twitter, their model uses 256Google TPU v3, although the training time is not announced. TPU v3 is available only for use outside of Google (although OpenAI may have a special license), which means they have to pay 8 * 256 = $2048 per hour.

Here's when OpenAI will show off its results-you can also go straight to the end of the article and click "read the original" to view the paper.

Without pre-training, 7 out of 8 data sets refresh the current best record

We have trained and benchmarked four language models, and their sizes are shown in the following table:

Architecture and hyperparameters of 4 model sizes

Among them, the smallest model is equivalent to the original GPT, and the second smallest model is equivalent to the largest BERT model. Our largest model is GPT-2, which has an order of magnitude more parameters than GPT.

GPT-2 has made state-of-the-art achievements in various domain-specific language modeling tasks. Our model does not train any data specific to these tasks, but evaluates them as a final test; this is the setting known as "zero-shot".

When evaluated on the same dataset, GPT-2 performs better than models trained on domain-specific datasets (such as Wikipedia, news, books).

The following table shows all our state-of-the-art zero-shot results.

(+) indicates that the higher the score, the better. (-) means the lower the score, the better.

GPT-2 gets SOTA results in these datasets.

GPT-2 implements the results of state-of-the-art on Winograd Schema, LAMBADA, and other language modeling tasks.

On each data set, the Zero-shot results of four different parameter size models.

As you can see, WebText LMs can be well transferred across domains and datasets, further improving the state of the art results of 7 of the eight datasets under the zero-shot setting.

You can see significant improvements on small datasets such as Penn Treebank and WikiText-2, which have only 1 million to 2 million training token. There are also significant improvements in datasets used to measure long-term dependencies, such as LAMBADA and the Children's Book Test.

Our model is still significantly worse than our previous work on One Billion Word Benchmark. This may be because it is both the largest dataset and some of the most destructive preprocessing-1BW's sentence-level transformation eliminates all remote structures.

Other tasks: question and answer, reading comprehension, summary and translation

In other language tasks, such as question answering, reading comprehension, summary summary, and translation, we have achieved excellent results without any fine-tuning to the model, only need to prompt the training model in the right way (the following examples show specific practices and results), but these tasks do not reach the SOTA level of the professional system.

1. Reading comprehension: answer questions about a given paragraph

Dataset: CoQA

Example

The 2008 Summer Olympics torch relay was run from March 24 until August 8, 2008, prior to the 2008 Summer Olympics, with the theme of "one world, one dream". Plans for the relay were announced on April 26, 2007, in Beijing, China. The relay, also called by the organizers as the "Journey of Harmony", lasted 129 days and carried the torch 137000 km (85000 mi)-the longest distance of any Olympic torch relay since the tradition was started ahead of the 1936 Summer Olympics.

After being lit at the birthplace of the Olympic Games in Olympia, Greece on March 24, the torch traveled to the Panathinaiko Stadium in Athens, and then to Beijing, arriving on March 31. From Beijing, the torch was following a route passing through six continents. The torch has visited cities along the Silk Road, symbolizing ancient links between China and the rest of the world. The relay also included an ascent with the flame to the top of Mount Everest on the border of Nepal and Tibet, China from the Chinese side, which was closed specially for the event.

Q: What was the theme?

A: one world, one dream.

Q: What was the length of the race?

A: 137000 km

Q: Was it larger than previous ones?

A: No

Q: Where did the race begin?

A: Olympia, Greece

Q: Is there anything notable about that place?

A: birthplace of Olympic Games

Q: Where did they go after?

A: Athens

Q: How many days was the race?

A: seven

Q: Did they visit any notable landmarks?

A: Panathinaiko Stadium

Q: And did they climb any mountains?

Target answers: unknown or yes

Model answer: Everest

Performance:

In the task of reading comprehension, GPT-2 is worse than DrQA+PGNet and much lower than human level.

2. Common sense reasoning: solving pronouns with vague meanings

Dataset: Winograd Schema Challenge

Example

The trophy doesn't fit into the brown suitcase because it is too large.

Correct answer: it = trophy

Model answer: it = trophy

The trophy doesn't fit into the brown suitcase because it is too small.

Correct answer: it = suitcase

Model answer: it = suitcase

Performance performance

In common sense reasoning tasks, GPT-2 is superior to SOTA.

3. Answer the question

Dataset: Natural Questions

Example

Who wrote the book the origin of species?

Correct answer: Charles Darwin

Model answer: Charles Darwin

What is the largest state in the U.S. By land mass?

Correct answer: Alaska

Model answer: California

Performance:

In the question and answer task, the performance of GPT-2 is much lower than that of BERT.

4. Language modeling of generalized context: predicting the last word of a paragraph.

Dataset: LAMBADA

Example

Both its sun-speckled shade and the cool grass beneath were a welcome respite after the stifling kitchen, and I was glad to relax against the tree's rough, brittle bark and begin my breakfast of buttery, toasted bread and fresh fruit. Even the water was tasty, it was so clean and cold. It almost made up for the lack of...

Correct answer: coffee

Model answer: food

Performance performance

In the language modeling task of generalized context, GPT-2 performs better than SOTA's model.

5. Write a summary: summarize the news articles

Datasets: CNN and Daily Mail datasets

Sample text:

Prehistoric man sketched an incredible array of prehistoric beasts on the rough limestone walls of a cave in modern day France 36000 years ago.

Now, with the help of cutting-edge technology, those works of art in the Chauvet-Pont-d'Arc Cave have been reproduced to create the biggest replica cave in the world.

...

Reference summary:

Cave mimics famous Caverne du Pont-d'Arc in France, the oldest cave decorated by man and the best preserved. The replica contains all 1000 paintings which include 425 such as a woolly rhinoceros and mammoths. Minute details were copied using 3D modelling and anamorphic techniques, often used to shoot widescreen images. The modern cave also includes replica paw prints of bears, bones and details preserved in the original cave.

Summary written by the machine:

The original site in Vallon-Pont-D'arc in Southern France is a Unesco World Heritage site and is the oldest known and the best preserved cave decorated by man. The replica cave was built a few miles from the original site in Vallon-Pont-D'Arc in Southern France. The cave contains images of 14 different species of animals including woolly rhinoceros, mammoths, and big cats.

Performance performance

In summary tasks, GPT-2 does not perform as well as specialized systems

6. Machine translation: translating French sentences into English

Dataset: WMT-14 Fr-En

Example

French sentences:

Un homme an Iqua que lop é gratuite qu'il avait subie pour soigner une hernie lui permettrait de travailler à nouveau.

Reference translation:

One man explained that the free hernia surgery he'd received will allow him to work again.

Translation of the model

A man told me that the operation gratuity he had been promised would not allow him to travel.

Performance performance

In French-English machine translation tasks, the performance of GPT-2 is not as good as that of special systems

We believe that since these tasks are a subset of common language modeling, we can expect performance to improve further as computing power and the amount of data increases. Other researchers have published similar hypotheses. We also expect to improve the performance of downstream tasks through fine-tuning, although this requires thorough experimentation.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.