In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
How to carry out SparkMlib theme model case analysis, many novices are not very clear about this, in order to help you solve this problem, the following small series will explain in detail for everyone, there are people who need this can learn, I hope you can gain something.
An article about the algorithm
1, LDA topic model
symbol definition
Document set D, m articles, topic set T, k topics
Each document d in D is treated as a sequence of words
< w1,w2,...,wn >wi denotes the ith word, and d has n words. (LDA is called word bag, in fact, the position of each word has no effect on LDA algorithm)
All the different words involved in D make up a large set VOCABULARY (VOC)
LDA fits the distribution
Each article d(length) has its own topic distribution, which is a multinomial distribution, and the parameters of the multinomial distribution obey Dirichlet distribution, and the parameters of the Dirichlet distribution are α.
Each topic has its own word distribution. The word distribution is multinomial distribution. The parameters of the multinomial distribution obey Dirichlet distribution.
For the nth word in a proposed article, first sample a topic from the topic distribution of the article, and then sample a word from the word distribution corresponding to the topic. Repeat this random generation process until all m articles are completed.
The result is that we want to train two result vectors (k topics, VOC contains m words)
LDA takes document set D as input (there will be segmentation, removal of stop words, stemming, etc.):
For each document d in D, the probability θd corresponding to a different topic
< pt1,..., ptk >where pti represents the probability that d corresponds to the ith topic in T. The calculation method is intuitive, pti=nti/n, where nti represents the number of words in d corresponding to the ith topic and n is the total number of all words in d.
For each topic t in T, the probability φt of generating different words
< pw1,..., pwm >where pwi represents the probability that t generates the ith word in VOC. The calculation is also straightforward, pwi=Nwi/N, where Nwi is the number of ith words in VOC corresponding to topic t and N is the total number of words corresponding to topic t.
The core formula for LDA is as follows:
p(w|d) = p(w|t)*p(t|d)
Intuitively looking at this formula, that is, with Topic as the middle layer, the probability of the word w appearing in document d can be given by the current θd and φt. where p(t)| d) Calculate p(w) using θd| t) is calculated using φt.
2, RegexTokenizer
RegexTokenizer allows documents to be segmented into groups of words based on a regular pattern. By default, the parameter "pattern"(regex, default: "s+") is used as a delimiter to divide input text. Alternatively, the user can set the parameter "gaps" to false, indicating that the regular expression "pattern" means "tokens" rather than splitting gaps, and find all matching events as the result of the split.
Please refer to: Tokenizer segmentation based on DF
3, StopWordsRemover
Stopwords are simply words that are widely used in a language. In various places where text needs to be processed, we do some special processing on these stop words so that we can focus more on the more important words.
Stop the word list generally do not need to make their own, there are many options you can download their own selection.
StopWordsRemover class processing stop words are provided in Spark and can be used as part of Machine learning Pipeline.
StopWordsRemover function is to directly remove all stop words (stop word), all input from inputCol will be checked by it, and then outputCol, these stop words will be removed.
For details, please refer to Wave article: StopWordsRemover processing based on DataFrame
4, CountVectorizer
CountVectorizer and CountVectorizerModel are designed to help convert collections of text documents into frequency vectors. When a priori dictionary is not available, CountVectorizer can be used as an Estimator to extract the vocabulary and generate a CountVectorizerModel. The model generates sparse matrices for documents based on the dictionary, which can be passed to other algorithms, such as LDA, for processing.
During the fitting process, CountVectorizer counts and sorts the top vocabSize words from the entire document collection.
An optional parameter minDF also affects the fitting process by specifying the minimum number of documents (or less than 1.0) in which the vocabulary must appear. Another optional binary switching parameter controls the output vector. If set to true, all non-zero counts are set to 1. This is especially useful for discrete probability models that simulate binary counts rather than integer counts.
For details, please refer to another article on the wave: CountVectorizer
second data
Data on 20 topics, one file per article, 100 files per topic. Two thousand documents.
Three Steps to Realization
1. Import data
val corpus = sc.wholeTextFiles("file:///opt/datas/mini_newsgroups/*").map(_._ 2).map(_.toLowerCase())
2. Data format arrangement
val corpus_body = corpus.map(_.split("\n\n")).map(_.drop(1)).map(_.mkString(" "))
val corpus_df = corpus_body.zipWithIndex.toDF("corpus", "id")
import org.apache.spark.ml.feature.RegexTokenizer
val tokenizer = new RegexTokenizer().setPattern("[\W_]+").setMinTokenLength(4).setInputCol("corpus").setOutputCol("tokens")
val tokenized_df = tokenizer.transform(corpus_df)
3. Import stop words
val stopwords = sc.textFile("file:///opt/datas/stop_words.txt").collect()
4. Remove stop words
import org.apache.spark.ml.feature.StopWordsRemover
// Set params for StopWordsRemover
val remover = new StopWordsRemover().setStopWords(stopwords).setInputCol("tokens").setOutputCol("filtered")
// Create new DF with Stopwords removed
val filtered_df = remover.transform(tokenized_df)
5. Generate word frequency vector
import org.apache.spark.ml.feature.CountVectorizer
// Set params for CountVectorizer
val vectorizer = new CountVectorizer().setInputCol("filtered").setOutputCol("features").setVocabSize(10000).setMinDF(5).fit(filtered_df)
val countVectors = vectorizer.transform(filtered_df).select("id", "features")
6. Build an LDA model
import org.apache.spark.ml.clustering.LDA
val numTopics = 20
// Set LDA params
val lda = new LDA().setK(numTopics).setMaxIter(10)
7, Training LDA model
val model = lda.fit(countVectors )
8. View training result data
val topicIndices = model.describeTopics(5)
9. Use of dictionaries
val vocabList = vectorizer.vocabulary
10, using models
val transformed = model.transform(dataset)
transformed.show(false)
Five adjustable test points
1. Add stop-words
val add_stopwords = Array("article", "writes", "entry", "date", "udel", "said", "tell", "think", "know", "just", "newsgroup", "line", "like", "does", "going", "make", "thanks")
val new_stopwords = stopwords.union(add_stopwords)
2, using EM
Optimizers or inference algorithms for estimating LDA models. Currently Spark supports two types:
online: Online Variational Bayes (default)
em: Expectation-Maximization
It can be used by calling setOptimizer(value: String), passing online or em.
Did reading the above help you? If you still want to have further understanding of related knowledge or read more related articles, please pay attention to the industry information channel, thank you for your support.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.