Word vector-LRWE model 04/18 Update SLTechnology News&Howtos

Word vector-LRWE model

2025-04-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >

Shulou(Shulou.com)06/01 Report--

Based on the CBOW model, we try to integrate the knowledge extracted from the knowledge base into common training, and put forward the LRWE model. The structure of the model is as follows:

The idea and solving method of the model are introduced in detail below.

1. LWE model

In Word2vec's CBOW model, the target word is predicted by the word in context, and the goal is to maximize the probability that the target word appears in its given context, so the result of word vector training is related to the word in its context. However, the CBOW model only considers the local context information of words, and can not well express synonyms and antonyms. For example, the following case:

In order to solve the above problems, this paper takes the lexical information such as synonyms and antonyms as the supervision data in word vector training in the form of external knowledge, so that the trained word vectors can learn synonyms, antonyms and other lexical information. so that we can better distinguish between synonyms and antonyms.

1.1 Model thought

Remember the set of synonyms and antonyms is (,), where SYN represents the set of synonyms, ANT represents the set of antonyms, our goal is to know the set of synonyms and antonyms corresponding to the target words, to predict the target words, so that the distance between the target words and its synonyms is as close as possible, and the distance from the antonyms is as far as possible.

For example, "The cat sat on the mat." it is known that sat has a synonym seated and an antonym stand to predict that the target word is sat.

The model is called lexical information model, and the structure of the model is as follows:

For a word, we predict the target word according to its synonyms and antonyms, maximize the probability that the word and its synonym appear at the same time, and reduce the probability that the word and its antonym appear at the same time. Based on this goal, the following objective functions are defined:

Our goal is to add synonym antonym information as supervision in the context-based CBOW language model training process, so that the trained word vectors can learn synonym and antonym knowledge. Based on this idea, we propose a word vector model (Lexical Information Word Embedding,LWE) based on lexical information, and the objective function is

The structure of the model is as follows:

It should be noted that the CBOW model and the lexical information model share the same word vector, which is to obtain each other's knowledge information by sharing representation, so that the word vector can make comprehensive use of context information and synonym antonym information during training, so as to obtain higher quality word vectors.

1.2 Model solving

From the model structure diagram, we can see that LWE can be regarded as the superposition of two CBOW models, so the optimization solution method is the same as the CBOW model, this paper uses Negative Sampling to optimize.

Using Negative Sampling's method, the target word is regarded as a positive sample, and the other words through negative sampling are called negative sample. In our model, for the synonym set of words, the target word is a positive sample, the words outside the synonym set are negative samples, the synonym set is, for ∈, the negative sample set is = | |, remember the indicator function.

The positive sample label is 1 and the negative sample label is 0. Then for the sample (,), in the training objective function (3-1)

Antonyms are the same, so for the entire thesaurus V, the overall objective function is:

1.3 Parameter update

To maximize the objective function (3-6), we use the random gradient ascent method. When the stochastic gradient rising method is used to solve the problem, it is necessary to find the derivatives of the objective function with respect to eu and θ w respectively.

From the above formula, we can see that the objective function of synonym and antonym is the same except that the definition domain is different, so we only need to derive the function. The derivative of the function ψ pair can be obtained as follows:

So the update formula is:

2. RWE model

There are many complex semantic relations between words, such as the epistatic relationship, "music" is the epistatic word of "mp3", and "bird" is the subordinate word of "animal". In addition to "bird", there are also "fish", "insect" and so on, with the same epistatic words "fish", "insect" and "bird", which should be similar or related in some sense. However, Word2vec only uses the word co-occurrence information in large-scale corpus for training, and the resulting word vector can only learn the text context information, but can not learn the relationship between words, so other complex semantic relations are also difficult to express fully.

The knowledge graph contains rich relational information of entity words, so this paper proposes a word vector model based on relational information, which trains the language model and the knowledge representation learning model together. By adding a variety of relational knowledge extracted from the knowledge graph, the word vector training process is not only based on the co-occurrence of contextual words, but also learn the corresponding relational knowledge. So as to improve the quality of word vector.

2.1 Model thought

The knowledge in the knowledge graph is generally organized in the form of triple (h,). According to the training process of CBOW, we can construct samples (h,), which represent many different relationships of association, such as (animal, _ hyponymy, bird).

After extracting the triple data, it is necessary to establish the representation of the relationship between words, such as the TransE model, which is the most convenient and effective representation method. The basic idea is that for a triple (h,), if the triple is factual information, then there is + ≈, that is, the + corresponding vector should be closer.

The model is called relational information model. The structure of the model is as follows. The input layer of the model is the corresponding triple set (h,) of the target words, the projection layer makes the identical projection, and the output layer predicts the target words in the dictionary.

For a word, using the supervised data such as relational triples in the knowledge graph, we hope that words can learn rich relational semantic information. According to this goal, the following objective functions are defined:

Then in the process of context-based CBOW language model training, rich relational information is added as supervision, so that the trained word vector can learn the complex semantic relationship between words. Based on this idea, we propose a word vector model (Relational Information Word Embedding,RWE) based on relational information, and the objective function is:

The structure of the model is as follows: the two models share the same set of word vectors, and this paper allocates a new vector space for the relationship settings in the triple, that is to say, the relation vector and the word vector are represented independently, in order to avoid conflict with the word vector.

2.2 solution method

Again, we use Negative Sampling for optimization. The simplification process is similar to 1.2, and the overall objective function is given here.

2.3 Parameter update

Similarly, the random gradient rising method is used for updating. When solving the problem, it is necessary to find the derivatives of the objective function with respect to eh+r and θ w respectively.

When the function ψ is derived from θ u, we can get:

The renewal formula of θ u is as follows:

3. LRWE model

The first two sections introduce two models, namely, the word vector model based on lexical information and the word vector model based on relational information, which are suitable for the problems in specific situations respectively. This paper attempts to combine the two models so that word vectors can learn not only synonyms, antonyms and other lexical information, but also complex relational semantic information. Based on this goal, the joint model LRWE is obtained.

The objective function of the joint word vector model is as follows:

The structure of the model is as follows:

3.1 characteristics of the model

Learn a variety of information at the same time by sharing word vectors

Different modules have independent parameters to maintain task differences.

Redistribute the relational vector space to avoid conflicts

3.2 theoretical comparison of models

From the point of view of the number of parameters, LWE uses lexical information to supervise on the basis of CBOW, sharing a word vector and needing more auxiliary parameter vectors, so the number of parameters is 2 | | × | + | × | = 3 | × | Similarly, RWE, a word vector model based on relational information, shares a word vector with CBOW, and has an independent auxiliary parameter vector, in addition, there is a relation vector, so the number of parameters is 3 | | × | + | | × |; the joint word vector model LRWE is a combination of the above two models, so the number of parameters is 4 | × | | + | | × | |

Model

Number of parameters

CBOW

2 | | × | |

LWE

3 | | × | |

RWE

3 | | × | | + | | × | |

LRWE

4 | | × | | + | | × | |

From the perspective of time complexity, the CBOW model scans each word of the corpus and takes the word and its context as a sample, so when comparing the model, only the time complexity of training one sample is analyzed.

CBOW model only needs a lot of calculation for Softmax prediction at the output layer, and its training complexity is (| | × |). If Hierarchical Softmax is used to optimize the Softmax at the output layer, it can be accelerated to (| × |), while with Negative Sampling, the complexity can be further optimized to (| |). LWE and RWE can be considered as the superposition of two CBOW models with a time complexity of (2 | |). Although it is more complex than the CBOW model, it can learn more semantic information in linear time, making the word vector more fully expressed.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.