Analyze the reference of LDA on the recommendation system 07/19 Update SLTechnology News&Howtos

Analyze the reference of LDA on the recommendation system

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article introduces the relevant knowledge of "analyzing the citation of LDA in the recommendation system". Many people will encounter such a dilemma in the operation of actual cases, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Guide reading

LDA is a classic algorithm for document classification, how to apply it to the recommendation system, you can see.

Latent Dirichlet Allocation (LDA) is a topic modeling algorithm for unsupervised discovery of underlying topics in corpus. It has been widely used in various fields, especially in natural language processing and recommendation systems.

A brief introduction

LDA is the probability model of corpus / document generation. It is based on the "word bag" assumption that words and documents are interchangeable. In other words, the order of words in the document is ignored, or the order of the document is ignored. The basic idea is that each document is composed of different topics, and each topic is described by the distribution of words.

Each document consists of a topic distribution

Each topic is represented by the distribution of words.

LDA assumes that the generation of a single document is generated by extracting topics from each document and then words from each extracted topic. To get the proper distribution of words and topics, we can use Gibbs Sampling, Maximum a Posteriori (MAP), or expect Maximization (EM) to train LDA.

Plate representation

To go a little further, let's discuss the symbolic representation of LDA. In Bayesian reasoning, Plate representation is a graphical method to represent the repetitive process of random variable sampling. Each plate can be thought of as a "loop", where the variable in the lower-right corner of the plate represents the number of iterations of the loop. Here is the Plate representation of LDA.

LDA plate representation

There are two components in the figure above. In the plate above, there are K topics, and the Dirichlet distribution of the words of these topics is controlled by the superparameter β. Similarly, the following table describes that there are M documents, each containing N words. The gray circle w is the observed word, and the circle represents different potential variables. Z refers to the topic associated with w, and θ is the Dirichlet distribution of the document topic, controlled by another hyperparameter ⍺.

Generation process

Now we have a rough idea of how to generate documents through plate notation. Let's express it mathematically.

Sampling θ from the Dirichlet distribution (θ _ I ~ Dir (⍺), I from 1 to M)

Sampling φ from another Dirichlet distribution (φ _ k ~ Dir (β) k from 1 to K)

Sample from z_ij ~ Multinomial (θ _ I), sample from w_ij ~ Multinomial (φ _ z_ij), I from 1 to M _ z_ij j from 1 to N

Take the New York Times as an example. First of all, for each news article, we sample the topic distribution θ _ i_ of the entire document. The distribution of words in each topic is sampled. Then, for the word j in each document, we get a topic z_ij from the given topic distribution Multinomial (θ _ I), then from the given word distribution Multinomial (φ _ z_ij) to w_ij, and sample a word based on w_ij. This process is shown in the figure below.

Visualization of the generation process

Dirichlet distribution

We have been using Dirichlet as a black box without giving any explanation. Let's briefly discuss the intuition behind the Dirichlet distribution. A k-dimensional Dirichlet distribution is controlled by a k-dimensional parameter vector. Let's show a three-dimensional example of the Dirichlet distribution. The basic idea is that the higher the alpha value, the greater the probability that the distribution will be pushed to the center. This distribution gives a high degree of flexibility in determining the parts of the words / topics associated with the topic / document, because some topics / documents may be associated with a large set of words / topics, while others may not.

Dirichlet distribution

Study

The problem of learning LDA model is called "reasoning" problem. Given the observation variable w and the superparameters ⍺ and β, how do we estimate the posterior probability of the latent variable.

However, the integral calculated in the denominator is very troublesome in calculation.

Therefore, approximate reasoning must be used. The common methods are Gibbs sampling and variational inference. In this article, we will focus on the former.

Gibbs sampling

Using Gibbs sampling, we can avoid calculating tricky integrals directly. The basic idea is that we want to estimate this distribution by sampling from p (w | ⍺, β), but we can't do this directly. In contrast, Gibbs sampling allows us to iteratively calculate the posterior value of one potential variable while fixing all other variables. In this way, we can obtain a posteriori distribution p (θ, z, φ | w, ⍺, β).

For each iteration, we alternately sample w, ⍺, β, and fix all other variables. The algorithm is shown in the following pseudo code:

For i from 1 to MaxIter:

Sample θ _ I} ~ p (θ z = z _ {imurl, φ = φ _ {imurl} w, ⍺, β)

Sample zhumi} ~ p (z θ = θ _ {I, φ = φ _ {imuri} w, ⍺, β)

Sample φ _ I} ~ p (φ θ = θ _ {I, z = z _ {I} w, ⍺, β)

Because the sample from the early iteration is unstable, we will discard the first B iteration of the sample, which is called "aging".

Application of LDA in recommendation system

LDA is commonly used for recommendation systems in two situations:

Collaborative filtering (CF)

Content-based recommendation

Collaborative filtering

When LDA is applied to item-based CF, items and users are similar to the documents and words we have been talking about (based on the user's CF is just the opposite). In other words, each item is associated with a distribution on a user group (topic), and each user group is a user distribution. Using LDA, we can discover the hidden relationship between the user and the item.

Content-based recommendation

The second application is content-based recommendation, which is very simple. We not only use the ordinary TF-IDF to extract the feature vector of the text data of each item, but also use LDA to model the topic of the text data. Sample code for training LDA and inferring a given document topic is provided below.

From gensim.test.utils import common_textsfrom gensim.corpora.dictionary import Dictionaryfrom gensim.models import LdaModel# Create a corpus from a list of textscommon_dictionary = Dictionary (common_texts) common_corpus = [common_dictionary.doc2bow (text) for text in common_texts] # Train the model on the corpus.lda = LdaModel (common_corpus, num_topics=10)

Training LDA

# infer the topic distribution of the second corpus.lda [common _ corpus [1]]''output [(0, 0.014287902), (1, 0.014287437), (2, 0.014287902), (3, 0.014285716), (4, 0.014285716), (5, 0.014285714), (6, 0.014285716), (7, 0.014285716), (8, 0.014289378), (9, 0.87141883)]''

Infer the distribution vector of the topic

This is the end of the content of "analyzing the citation of LDA on the recommendation system". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.