How to analyze the case of SparkMllib topic Model 07/01 Update SLTechnology News&Howtos

How to analyze the case of SparkMllib topic Model

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to carry out SparkMlib theme model case analysis, many novices are not very clear about this, in order to help you solve this problem, the following small series will explain in detail for everyone, there are people who need this can learn, I hope you can gain something.

An article about the algorithm

1, LDA topic model

symbol definition

Document set D, m articles, topic set T, k topics

Each document d in D is treated as a sequence of words

< w1,w2,...,wn >

wi denotes the ith word, and d has n words. (LDA is called word bag, in fact, the position of each word has no effect on LDA algorithm)

All the different words involved in D make up a large set VOCABULARY (VOC)

LDA fits the distribution

Each article d(length) has its own topic distribution, which is a multinomial distribution, and the parameters of the multinomial distribution obey Dirichlet distribution, and the parameters of the Dirichlet distribution are α.

Each topic has its own word distribution. The word distribution is multinomial distribution. The parameters of the multinomial distribution obey Dirichlet distribution.

For the nth word in a proposed article, first sample a topic from the topic distribution of the article, and then sample a word from the word distribution corresponding to the topic. Repeat this random generation process until all m articles are completed.

The result is that we want to train two result vectors (k topics, VOC contains m words)

LDA takes document set D as input (there will be segmentation, removal of stop words, stemming, etc.):

For each document d in D, the probability θd corresponding to a different topic

< pt1,..., ptk >

where pti represents the probability that d corresponds to the ith topic in T. The calculation method is intuitive, pti=nti/n, where nti represents the number of words in d corresponding to the ith topic and n is the total number of all words in d.

For each topic t in T, the probability φt of generating different words

< pw1,..., pwm >

where pwi represents the probability that t generates the ith word in VOC. The calculation is also straightforward, pwi=Nwi/N, where Nwi is the number of ith words in VOC corresponding to topic t and N is the total number of words corresponding to topic t.

The core formula for LDA is as follows:

p(w|d) = p(w|t)*p(t|d)

Intuitively looking at this formula, that is, with Topic as the middle layer, the probability of the word w appearing in document d can be given by the current θd and φt. where p(t)| d) Calculate p(w) using θd| t) is calculated using φt.

2， RegexTokenizer

RegexTokenizer allows documents to be segmented into groups of words based on a regular pattern. By default, the parameter "pattern"(regex, default: "s+") is used as a delimiter to divide input text. Alternatively, the user can set the parameter "gaps" to false, indicating that the regular expression "pattern" means "tokens" rather than splitting gaps, and find all matching events as the result of the split.

Please refer to: Tokenizer segmentation based on DF

3， StopWordsRemover

Stopwords are simply words that are widely used in a language. In various places where text needs to be processed, we do some special processing on these stop words so that we can focus more on the more important words.

Stop the word list generally do not need to make their own, there are many options you can download their own selection.

StopWordsRemover class processing stop words are provided in Spark and can be used as part of Machine learning Pipeline.

StopWordsRemover function is to directly remove all stop words (stop word), all input from inputCol will be checked by it, and then outputCol, these stop words will be removed.

For details, please refer to Wave article: StopWordsRemover processing based on DataFrame

4， CountVectorizer

CountVectorizer and CountVectorizerModel are designed to help convert collections of text documents into frequency vectors. When a priori dictionary is not available, CountVectorizer can be used as an Estimator to extract the vocabulary and generate a CountVectorizerModel. The model generates sparse matrices for documents based on the dictionary, which can be passed to other algorithms, such as LDA, for processing.

During the fitting process, CountVectorizer counts and sorts the top vocabSize words from the entire document collection.

An optional parameter minDF also affects the fitting process by specifying the minimum number of documents (or less than 1.0) in which the vocabulary must appear. Another optional binary switching parameter controls the output vector. If set to true, all non-zero counts are set to 1. This is especially useful for discrete probability models that simulate binary counts rather than integer counts.

For details, please refer to another article on the wave: CountVectorizer

second data

Data on 20 topics, one file per article, 100 files per topic. Two thousand documents.

Three Steps to Realization

1. Import data

val corpus = sc.wholeTextFiles("file:///opt/datas/mini_newsgroups/*").map(_._ 2).map(_.toLowerCase())

2. Data format arrangement

val corpus_body = corpus.map(_.split("\n\n")).map(_.drop(1)).map(_.mkString(" "))

val corpus_df = corpus_body.zipWithIndex.toDF("corpus", "id")

import org.apache.spark.ml.feature.RegexTokenizer

val tokenizer = new RegexTokenizer().setPattern("[\W_]+").setMinTokenLength(4).setInputCol("corpus").setOutputCol("tokens")

val tokenized_df = tokenizer.transform(corpus_df)

3. Import stop words

val stopwords = sc.textFile("file:///opt/datas/stop_words.txt").collect()

4. Remove stop words

import org.apache.spark.ml.feature.StopWordsRemover

// Set params for StopWordsRemover

val remover = new StopWordsRemover().setStopWords(stopwords).setInputCol("tokens").setOutputCol("filtered")

// Create new DF with Stopwords removed

val filtered_df = remover.transform(tokenized_df)

5. Generate word frequency vector

import org.apache.spark.ml.feature.CountVectorizer

// Set params for CountVectorizer

val vectorizer = new CountVectorizer().setInputCol("filtered").setOutputCol("features").setVocabSize(10000).setMinDF(5).fit(filtered_df)

val countVectors = vectorizer.transform(filtered_df).select("id", "features")

6. Build an LDA model

import org.apache.spark.ml.clustering.LDA

val numTopics = 20

// Set LDA params

val lda = new LDA().setK(numTopics).setMaxIter(10)

7, Training LDA model

val model = lda.fit(countVectors )

8. View training result data

val topicIndices = model.describeTopics(5)

9. Use of dictionaries

val vocabList = vectorizer.vocabulary

10, using models

val transformed = model.transform(dataset)

transformed.show(false)

Five adjustable test points

1. Add stop-words

val add_stopwords = Array("article", "writes", "entry", "date", "udel", "said", "tell", "think", "know", "just", "newsgroup", "line", "like", "does", "going", "make", "thanks")

val new_stopwords = stopwords.union(add_stopwords)

2, using EM

Optimizers or inference algorithms for estimating LDA models. Currently Spark supports two types:

online: Online Variational Bayes (default)

em: Expectation-Maximization

It can be used by calling setOptimizer(value: String), passing online or em.

Did reading the above help you? If you still want to have further understanding of related knowledge or read more related articles, please pay attention to the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.