Meta releases the first "non-parametric" masking language model NPM: GPT-3 with 500x parameters 07/12 Update SLTechnology News&Howtos

Meta releases the first "non-parametric" masking language model NPM: GPT-3 with 500x parameters

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

Meta released the first non-parametric masking language model NPM: I put my hands in my pockets that year and didn't know what out of vocabulary was.

Although the powerful performance of large language models in the field of NLP is amazing, it also brings serious negative costs, such as training is too expensive and difficult to update. And it is difficult to deal with long tail knowledge.

And the language model usually uses a softmax layer with limited vocabulary in the prediction layer, which basically does not output rare words or phrases, which greatly limits the expression ability of the model.

In order to solve the problem of long tail of the model, scholars from the University of Washington, Meta AI and Allen Institute of artificial Intelligence recently jointly proposed the first "non-parametric masking language model" (NonParametric Masked language model, NPM), which replaces the softmax output by referring to the non-parametric distribution of each phrase in the corpus.

Paper link: https://arxiv.org/ abs / 2212.01349 code link: https://github.com/ facebookresearch / NPMNPM can be effectively trained by comparing the target (contrastive objective) and within the batch similar to searching the complete corpus.

The researchers conducted zero-shot assessments on 9 closed tasks and 7 open tasks, including emphasizing the need to predict the spatio-temporal transformation of new facts or rare phrases and word-level translation tasks.

The results show that NPM is significantly better than the larger parameter models, whether using retrieval and generation methods or not, such as GPT-3 with 500x parameters and OPT 13B with 37x. NPM is particularly good at dealing with rare patterns (word meaning or facts) and predicting rare or almost unknown words (such as non-Latin words).

Although the first non-parametric language model combined with some existing retrieval and generation (retrieve-and-generate) related work can alleviate this problem, the final prediction part of these models still needs a softmax layer to predict token, which does not fundamentally solve the long tail problem.

NPM consists of an encoder and a reference corpus. The encoder maps the text into a fixed-size vector, from which NPM retrieves a phrase and fills in [MASK].

As you can see, NPM chooses the non-parameterized distribution over the phrase instead of using a fixed output word softmax as the output.

But training non-parametric models also brings two key problems:

1. It is very time-consuming and laborious to retrieve the complete corpus in the training process, and the researchers solve the problem by using the intra-batch approximation of the complete corpus.

two。 It is difficult to learn to predict phrases of arbitrary length without a decoder, and the researchers solved it by extending the comparative goals at the span masking and phrase levels.

In short, NPM completely removes the softmax of the output vocabulary and achieves an effective unbounded output space by predicting any number of n-gram.

The resulting model can predict "extremely rare" or even "completely unseen" words (such as Korean words), and can effectively support an unlimited vocabulary, which existing models can not do.

The key idea of NPM method NPM is to use encoders to map all phrases in the corpus to a dense vector space. In reasoning, when given a query with [MASK], use the encoder to find the most recent phrase from the corpus and fill in [MASK].

The pure encoder (Encoder-only) model is a very competitive representation model, but the existing pure coding model can not predict the unknown number of token, so their use is limited without fine tuning.

NPM solves this problem by retrieving a phrase to fill any number of token in [MASK].

The inference encoder maps each different phrase in the reference corpus C to a dense vector space.

During the test, the encoder maps the query by masked to the same vector space and retrieves a phrase from C to populate [MASK].

Here, C does not have to be replaced or extended during testing, like the training corpus, without the need to retrain the encoder.

In practice, there are a large number of phrases in the corpus, and it is expensive to index all the phrases.

For example, if we consider a phrase with up to 1 token (l ≈ 20), we need to index l × | C | a number of vectors, which can be time-consuming.

The researchers index each different token in C to reduce the size of the index from l × | C | to | C |, and then approximate the nonparametric distribution of all phrases by doing k nearest neighbor search at the beginning and end respectively.

For example, the phrase Thessaloniki, which consists of four BPE token, is represented by the connection of C1 and c4, corresponding to the beginning (The) and end (iki) of the phrase, respectively.

Then two vectors q_start and q_end in the same vector space are used to represent a query, and then each vector is used to retrieve the beginning and end of specious phrases, and then aggregate.

This is done on the premise that the presentation of the beginning and the end is good enough, that is, the Q start is close enough to C1 and the Q end is close to c4, and this has been ensured during the training process.

The training NPM is trained on untagged text data to ensure that the encoder maps the text to a good dense vector space.

There are two main difficulties in training NPM: 1) complete corpus retrieval will make the training very time-consuming; 2) fill [MASK] with phrases of arbitrary length instead of token.

1. Mask Masking

The fragment mask (span masking) is the mask of successive token whose lengths are sampled from the geometric distribution.

The researchers extended this:

1) if some fragments appear together in other sequences in the batch, they are masked to ensure the in-batch positives in the batch during the training period.

For example, the masked fragments 2010, the Seattle Seahawks and to the all appear together in another sequence.

But for "game," the bigram cannot be mask together, although they also appear in the two sequences, but they do not co-occur together.

2) instead of replacing each token in the fragment with [MASK], replace the entire fragment with two special token [MASKs] [MASKe].

For example, in the above example, regardless of the length of the fragment being mask, it is replaced with [MASKs] [MASKe], so that the start and end vectors of each segment can be obtained, which is more convenient for reasoning.

two。 Training goal

Assuming that the fragment being mask is the Seattle Seahawks, the model should retrieve the phrase the Seattle Seahawks from other sequences in the reference corpus during testing.

In the inference stage, the model obtains vectors from [MASKs] and [MASKe] and uses them to retrieve the beginning and end of phrases from the corpus respectively.

Therefore, the goal of the training should be to encourage the vector of [MASKs] to be closer to the the in the Seattle Seahawks, but farther away from other token, and should not be the the in any phrase, such as become the first.

This is achieved by approximating the complete corpus to other sequences in batch to train the model. Specifically, the training model retrieves the starting point and end point of the the Seattle Seahawks segment from other sequences of the same batch.

It is important to note that the mask policy ensures that each obscured span has a common segment in a batch.

In the experimental part, the performance of NPM under zero-shot setting is better than that of other baseline models.

Among the parameterized models, RoBERTa achieves the best performance and unexpectedly outperforms the models including GPT-3, probably because the bi-directionality of the pure encoder model plays a vital role, which also indicates that the causal language model may not be a suitable classification choice.

The kNN-LM method adds non-parametric components to the parametric model, and its performance is better than all other baselines. However, relying solely on kNN performs poorly in GPT-2, which shows that the use of kNN only in reasoning is limited.

Both NPM SINGLE and NPM performed significantly better than all baselines, achieving consistent superior performance across all data sets. This shows that nonparametric models are very competitive even for tasks that do not explicitly require external knowledge.

The qualitative analysis was based on the prediction results of RoBERTa and NPM in the emotion analysis task. The first example is cheap to show that it is not expensive, and the second example is cheap to show that the quality is very poor.

RoBERTa's predictions for both examples are positive, while NPM makes correct predictions by retrieving contexts that use cheap in the same context as the input.

It can also be found that the representation of NPM output can lead to better word sense disambiguation. For example, RoBERTa allocates a high similarity score between cheap (cheap) and cheap (poor quality).

On the other hand, NPM successfully allocates a low similarity score between cheap and cheap, which also shows that the nonparametric training and contrast goal are effective and can improve representation learning better, while algorithms without training such as kNN reasoning can not be done at all.

Reference:

Https://arxiv.org/abs/2212.01349

This article comes from the official account of Wechat: Xin Zhiyuan (ID:AI_era), editor: LRS

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.