How to express Embedding7 through universal text 04/22 Update SLTechnology News&Howtos

How to express Embedding7 through universal text

2025-04-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces how to express Embedding7 through a general text. The content is very detailed. Interested friends can use it for reference. I hope it will be helpful to you.

Let's talk about what SOTA general text frameworks are available, and there may not be many scenarios that use them directly, but you can still see them in various cutting-edge approaches.

Versatility

Human language has almost infinite statistical complexity, but it can get a good approximation in low dimensions. The exploration of general text expression is to find a better language model framework to extract information from the text as comprehensively as possible.

It is often difficult for NLP tasks to get a large number of labeled samples. General text expression takes the text vector obtained from the pre-training model as the model input of the downstream task through feature transfer, which helps the model to skip the step of extracting information from the text. The information is already in the text vector, and the model only needs to extract information from the vector for the current task, which can greatly reduce the need for tagging data. Makes it possible for many impossible tasks that lack samples.

The following paper mostly choose the way of feature transfer, that is, the text expression obtained by the pre-training model is used as the input of downstream tasks to evaluate generality. The downstream tasks mainly include classification tasks and text semantic similarity tasks, among which classification tasks are

MR judges whether the film reviews are positive or negative.

CR: product customer reviews, positive / negative reviews

MRPC judges whether two sentences are synonymous.

SUBJ judges whether a sentence is subjective / objective.

On the classification of MPQA's opinions and positions

The belonging Classification of TREC problem

To evaluate the performance of the text vector in the classification problem, we generally use the simplest logistic classifier, the input is the text vector and the output is the classification result, so as to minimize the impact of the model structure and only evaluate whether the text vector itself contains the information needed for the classification problem.

The text semantic similarity task (STS Benchmark) includes

SICK text pairs are annotated related and included

STS text scores the similarity marked 0-5

To evaluate the text similarity, generally calculate the cosine distance of the text vector directly, and then calculate and label the pearson correlation of label.

There are also some open source library for text evaluation that can be used directly for the above tasks, such as SentEval,GLUE Benchmark

Model frame

Below we will introduce the four model architectures and their performance on the above benchmark dataset. However, I don't think I need to pay too much attention to the improvement of the performance of each new architecture on Benchmark. I've seen comments that ridicule the phenomenon of brushing the list: the new model must use grid-search for hyperparameter search until it surpasses the existing SOTA method, but it will never tune other methods for comparison, which is not far away. So let's just focus on the interesting innovations in the architecture and the logic behind it.

FastSent | SDAE (Hill 2016)

Take Away: different downstream information extraction methods will extract different information from the same text, and log-bilinear text expressions perform better in text similarity tasks.

Let's briefly take a look at the other two ways of generating text vectors mentioned in paper:

FastSent: the fast version of Skip-thought, in fact, ignores the word order and uses the word vector sum as the sentence vector, and the task remains unchanged is to use the middle sentence to predict the sentence before and after.

SDAE: skip-thought training relies on continuous texts like novels. SADE is for training programs like twitter that have no context and only a single sentence. First, the sentence itself is randomly deleted and replaced in order, and then autoencoder is used to predict the original sentence itself. In fact, it is similar to Bert's MLM cloze task, except that Bert only predicts the word Mask, while SDAE predicts the whole sentence.

Here we focus not on these two algorithms, but on the fact that paper compares skip-thought, Fastsent,SDAE, DBOW, BOW,DictRep with dictionary interpretation of BOW or RNN to fit the interpreted word vector, CaptionRep uses title vector to fit picture vector, and NMT translation task, get the performance of different text expressions in downstream tasks, there are some interesting conclusions.

There is nothing to say about the downstream tasks of text classification, but skip-thought has the best overall performance (2016) ~

The results of the text similarity task are more interesting. On the whole, the log-bilinear model includes Fastsent,DickRep, and the text vector representation obtained by averaging the CBOW word vector directly has better performance on the STS and SICK data sets.

It is not that other vectors do not learn Semantic Similarity information, but that the information can not be simply extracted by cosine distance. Therefore, it is important not only how to generate general text, but also how to extract information from the text. Log-bilinear models, such as CBOW, are calculated by vector addition / subtraction when the gradient is updated, implying distance calculation, so it is more suitable for cosine distance calculation. Another way to get a textual representation of Semantically meaningful is to add vector distance calculation to the process of training embedding. The following Infersent uses a similar operation.

InferSent (Facebook 2017)

Take Away: not all the general expressions of supervised models are not good, NLI tagging data is fine!

Before InfeSent, the general text expression was mainly based on unsupervised models such as Skip-thought/FastSent, not without other supervised models, but the results were not very good. The article points out that the reason for the poor generality of the text expression of the supervision model is that it is easy for NN to learn the particularity of specific supervision tasks (Inductive Bias). For example, judging positive and negative tasks may mainly focus on keyword information related to positive and negative, and translation pays more attention to the corresponding relationship between the same grammatical structure and words, thus neglecting the overall semantics of the text. Unsupervised tasks such as predicting their own SDAE and predicting NSP tasks before / after sentences will get text expressions containing more complete semantic information because they have no particularity. But paper pointed out that not all supervision tasks are bad, NLI is fine!

Let's first take a look at what the NLI data set looks like. SNLI is a RTE implied in text. Five students mark whether the text and assumptions are positive implication, contradiction implication or independent implication. Finally, major vote gets the classification tags of neutral and contradiction,entailment. The author believes that NLI needs to really understand the text in order to make a judgment, which makes NLI more suitable for learning general text expressions. This explanation is so abstract. It's like there's no explanation. However, to some extent, compared with the text similar task, the translation task of NLI is indeed more difficult to abstract the pattern of task-specific, which has no very consistent requirements on whether the grammatical structure is the same and whether it contains the same or synonyms.

The siamese structure is used in the InferSent model, two sentences share an encoder, and the text vectors of u and v are obtained respectively. Then use three calculation methods, vector concatenation\ ([upenirev]\), multiplication\ (u\ cdot v\), subtraction\ (| Umurv |\) (to ensure the absolute value of symmetry), to help the later fully connected layer extract the interaction information between vectors, and finally follow a 3-class classifier.

For Encoder, the author compares GRU/LSTM/BiLSTM+max/avg pooling, self-attention and Hierarchical ConvNet, and finds that the text vector obtained by BiLSTM+max pooling almost surpasses skip-thought in the evaluation of downstream tasks, and is almost comparable to the supervision model of direct task training in some tasks such as CR.

The later Sentence-Bert also borrowed from InferSent's framework, but replaced part of encoder with bert, which was left for Bert.

GenSen (Microsoft 2018)

Take Away: the text expression of a single task exists in inductive-bias, which can be fused through multi-task

InferSent obtains a more general text expression by looking for relatively abstract monitoring tasks that require text understanding, while GenSen proposes that multiple monitoring tasks can be directly integrated to improve versatility. GenSen selects a total of four categories of tasks. On the premise of meeting the diversity, each task should have a good effect on text expression and a large enough training sample, including Skip-thought,NMT translation task, NLI inference task and parsing parsing.

GenSen uses a relatively simple multi-task training method for different data sources. The above tasks are all inputted in English, so we share a GRU Encoder to ensure that different tasks are updating the same way of information extraction, and get a text expression containing multiple task information. The Encoder here follows Skip-though 's conditional GRU, unfamiliar with children's shoes can see here omnipotent Embedding4-skip-thought & tf-Seq2Seq source code analysis. Each task has a different Decoder, and a task is randomly selected with equal weight in each round, and the samples of the same batch_size are taken from the task for gradient update. Here is the performance of GenSen in downstream tasks

The rightmost\ (\ Delta\) is an improvement in relative InferSent, and you will find that as GenSen joins more target tasks, the average performance of 10 tasks relative to Infsersent continues to rise, although not all downstream tasks get better with the increase of training goals. After that, many pre-training language models, including Bert, follow the idea of multi-task, but choose different multi-objective tasks.

USE (Google 2018)

Take Away: also generate general text vectors through multi-task

At the same time as GenSen, another model architecture of multi-task has been proposed, which is universal-sentence-encoder. It seems that USE is more famous, probably because of the open source Large,Lite and MultiLingual model [Ref9] on hub, which can be easily used out of the box, or do finetune in new scenarios. There are two main differences between and GenSen

Multi-objective task selection is different, although USE is still a general text, but the target task training method it chooses leads to a more Semantically Meaningful text expression. Usually, the pre-training model can get very good results in all kinds of text similarity tasks. The three target tasks are Skip-thought sentence prediction task, Input-response dialogue task, and NLI inference task. So the Semantic Similar text may have similar contexts, similar questions or answers, and similar inference, as shown in the following figure.

The choice of Encoder is different, GenSen continues to use GRU Encoder, while USE gives DAN and transformer two kinds of encoder with different computational complexity. DAN is Lite version of Encoder without considering word order but word vector summation as input. Transfromer is Large version of Encoder with higher complexity and usually better results. Students who are not familiar with transformer take a look at the omnipotent Embedding6-stepping into the Transformer era ~ detailed explanation of the model & code implementation

That's all the general text framework has to say. Finally, let's advertise the two artifacts.

Connected Paper is looking for a paper artifact. The structure of the picture allows you to find out before and after five hundred years without any effort.

Paper with code added Dataset function to locate the Benchmark data set with one click, so my mother no longer has to worry that I can't find the data.

On how to express Embedding7 through the general text to share here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.