How Word2Vec works and how it differs from and relates to LDA 07/03 Update SLTechnology News&Howtos

How Word2Vec works and how it differs from and relates to LDA

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How Word2Vec works and what is the difference and connection with LDA? I believe many inexperienced people don't know what to do about it. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

How does Question:Word2Vec work? What is the difference and connection between it and LDA?

The goal of CBOW is to predict the generation probability of the current word according to the words that appear in the context, as shown in figure (a), while Skip-gram predicts the probability of word generation in the context based on the current word, as shown in figure (b).

Two Network structures of Word2Vec

W (t) is the word concerned at present, and w (t − 2), w (t − 1), w (t − 1), w (tcm 2) are the words that appear in the context. Here the front and back sliding window size is set to 2.

Both CBOW and Skip-gram can be expressed as neural networks composed of input layer (Input), mapping layer (Projection) and output layer (Output).

Each word in the input layer is represented by unique thermal coding, that is, all words are represented as an N-dimensional vector, where N is the total number of words in the vocabulary. In the vector, each word sets the corresponding dimension to 1, and the value of the other dimensions to 0.

In the mapping layer (also known as the hidden layer), the values of K hidden units (Hidden Units) can be calculated from the N-dimensional input vector and the N × K-dimensional weight matrix between the connected inputs and the hidden elements. In CBOW, it is also necessary to sum the hidden units calculated by each input word.

Similarly, the value of the output layer vector can be calculated by the hidden layer vector (K-dimension) and the K × N-dimensional weight matrix connecting the hidden layer and the output layer. The output layer is also an N-dimensional vector, and each dimension corresponds to a word in the vocabulary. Finally, the generation probability of each word can be calculated by applying the Softmax activation function to the output layer vector. The Softmax activation function is defined as

Where x represents the N-dimensional original output vector, and xn is the value of the dimension corresponding to the word wn in the original output vector.

The next task is to train the weight of the neural network to maximize the overall generation probability of all words in the corpus. From the input layer to the hidden layer, we need a weight matrix with dimension N × K, and from the hidden layer to the output layer, we need a weight matrix with dimension K × N. the learning weight can be realized by back propagation algorithm. in each iteration, the weight is updated in a small step in the direction of better gradient. However, because there is a normalized term in the Softmax activation function, the derived iterative formula needs to traverse all the words in the vocabulary, which makes the process of each iteration very slow, resulting in two improved methods: Hierarchical Softmax and NegativeSampling. After training two weight matrices with dimensions N × K and K × N, one of them can be selected as the K-dimensional vector representation of N words.

The difference and relationship between Word2Vec and LDA

First of all, LDA uses the co-occurrence relationship of words in documents to cluster words by topic, which can also be understood as decomposing the "document-word" matrix to get two probability distributions of "document-topic" and "topic-word". Word2Vec is actually learning the "context-word" matrix, in which the context is composed of several words around it, and the resulting word vector representation is more integrated with the characteristics of context co-occurrence. In other words, if the Word2Vec vectors corresponding to two words are highly similar, then they are likely to often appear in the same context. It should be noted that the above analysis is the difference between LDA and Word2Vec and should not be regarded as the main difference between topic model and word embedding methods. Through some structural adjustment, the topic model can be inferred based on the "context-word" matrix. Similarly, the word embedding method can also learn the implicit vector representation of words according to the "document-word" matrix. The biggest difference between the two methods of topic model and word embedding lies in the model itself. the topic model is a generative model based on probability graph model, and its likelihood function can be written in the form of several conditional probability multiplications. it includes implicit variables (that is, topics) that need to be speculated. The word embedding model is generally expressed in the form of neural network, and the likelihood function is defined on the output of the network, so it is necessary to learn the weight of the network to get the dense vector representation of the word.

After reading the above, have you mastered how Word2Vec works and how it is different and related to LDA? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.