What is the implementation of k-means algorithm based on LDA under Spark platform 04/20 Update SLTechnology News&Howtos

What is the implementation of k-means algorithm based on LDA under Spark platform

2025-04-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

What is the implementation of k-means algorithm based on LDA under Spark platform? in order to solve this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.

1. Text Mining Module Design 1.1 text Mining process

Text analysis is a broad field of machine learning, and it has been widely used in emotion analysis, chat robot, spam detection, recommendation system, natural language processing and so on.

Text clustering is an important concept in the field of information retrieval, which is widely used in the field of text mining. Text clustering can automatically divide the text data set into different clusters, so as to better organize text information, and achieve efficient knowledge navigation and browsing.

This paper selects the topic model LDA (Latent Dirichlet Allocation) algorithm to classify the documents, and chooses to implement the LDA algorithm through Spark MLlib on the Spark platform, in which Spark Mllib is the machine learning library provided by Spark, which provides commonly used machine learning algorithms.

1.2 Analysis of text mining process

The first is the data source part, the main data includes document data and Internet crawler data. Then there is the part of data extraction, which uploads the collected data to the distributed file system HDFS through synchronization tools as the data source for model training. The second is the part of data exploration and preprocessing, which mainly preprocesses the original data set, including word segmentation, stop word filtering, feature vector extraction and so on. The third is the model training part, which mainly includes training and testing, so as to get a model. Finally, the model evaluation, after the evaluation of the learned model, online deployment.

two。 Research on the algorithm of text Mining Module 2.1LDA topic Model algorithm

LDA (Latent Dirichlet allocation) is a topic model algorithm based on probability model proposed by David M. Blei,Andrew Y. Ng,Michael I. Jordan in 2003, that is, implicit Dirichlet distribution, which can give the topic of each document in the document set in the form of probability distribution, project the text vector into the topic space that is easier to analyze and deal with, and remove the noise in the text. It is a commonly used text analysis technique. It can be used to identify potential topic information in large-scale document sets or corpus, and is usually used to model large-scale document data. Through the topic model and the set number of topics, we can train the proportion of different topics in the document set and the probability of the occurrence of key words under each topic. The topic distribution and topic proportion can be obtained from the document collection, which can be further used in data mining tasks.

LDA borrows the idea of word bag, selects a topic with a certain probability, and then selects each word in the topic with a certain probability, and generates all the words in the document by repeating this step constantly. This method makes a fuzzy clustering of words, and the words gathered into one class can indirectly express an implied topic. LDA mining text information, can be used to measure the potential relationship between different documents, but also through a certain class of words to express the hidden topics in the document.

2.2 K-means algorithm

Through the clustering process, a group of abstract objects are divided into several groups, each group is composed of similar objects, which is called a category. Different from classification (classifying data according to pre-defined classification criteria), clustering is a kind of unsupervised learning (unsupervised learning). The label information of the training data set is unknown. The goal is to aggregate the unlabeled training samples according to the similarity degree of a specific measure, so as to provide a basis for further data analysis.

The basic idea of the K-means (k-means) algorithm is to initially randomly give K cluster centers, that is, k arbitrary objects from n data objects are selected as the initial cluster centers, and the sample points to be classified are divided into each cluster according to the nearest neighbor principle. Then the center of each cluster (the mean of all data objects in this category) is recalculated according to the average method to determine the new cluster center. Iterate until the moving distance of the cluster center is less than a given value.

The K-means algorithm adopts a greedy strategy and approximately solves the E value of the above equation through iterative optimization. The flow of the algorithm is shown in the following figure.

2.3 Optimization of text mining algorithm

LDA topic model algorithm is applied to document clustering, and the calculated topic can be regarded as the clustering center of documents. Document clustering using topic model can effectively organize document data sets. At the same time, because the LDA topic model can calculate the probability distribution of each document under different topics, the probability distribution of this topic can be taken as the feature vector of the document, thus the high-dimensional document vector can be projected into the low-dimensional feature space.

Calculating the distance between texts is a key step in traditional K-means algorithms for text clustering, while texts are usually unstructured data, and the constructed text vectors have the characteristics of sparsity and high dimensionality. At the same time, the construction of text feature vectors does not take into account the semantic relationship between text, so it may result in non-similarity between texts in the same cluster.

Therefore, this paper improves the K-means algorithm based on the LDA topic model. Firstly, the document data set is modeled by the LDA topic model, and the topic probability distribution of each document is mined, which can not only achieve the effect of reducing the document and removing noise, but also make up for the defect that the document feature vector is easy to lose information by keyword construction. Finally, the topic probability distribution of each document is used as the input data set of the K-means algorithm.

3. Experimental Analysis 3.1 introduction to the implementation of datasets based on Spark LDA topic Model algorithm

Newgroups is a news data set, which consists of about 20000 news documents, divided into six major categories, each of which is divided into different small categories, with a total of 20 small categories, as shown in the following table. The news data set has become a commonly used data set in machine learning text mining experiments, such as text classification and text clustering.

The data set contains 7 files, of which 3 files are training data (train.data, train.label, train.map), a total of 11269, the other 3 files are test data (test.data, test.label, test.map), a total of 7505, and the other file is a glossary (vocabulary.txt). Line I indicates the name of the word numbered I. The file format with the .data file extension is [docIdx wordIdx count], where docIdx represents the document number, wordIdx represents the word number, and count represents the word frequency statistics of the word. A file with a .label extension represents the subject classification of the document, and each line of data represents the category of a document. The text with the file extension .map represents the mapping between the category number and the category name in the format [labelName labelId].

Original data set processing

The original dataset format is [docIdx wordIdx count]. For example, [1Jing 20jue 2] indicates that in the document numbered 1, the word frequency of the word numbered 20 is 2. The format of the parameters accepted by LDA is:

[label, (vector_ size, [wiIdx,wjIdx, wnIdx], [tfi,tfj, tfn])]

The data in the above format represents a sparse vector with tags, where label represents the document number, vector_ size represents the dimension of the vector, wnIdx represents the index number of the word n, and tfn represents the word frequency of the word n. The original data needs to be converted to the above format, and the specific steps are as follows:

Step1: upload the original dataset to HDFS [kms @ kms-1 ~] $hdfs dfs-put / opt/modules/train_data/lda/train.data / train/lda/

.builder

.appName (s "${this.getClass.getSimpleName}") .getOrCreate ()

/ / set log level

Logger.getLogger ("org.apache.spark") .setLevel (Level.OFF) Logger.getLogger ("org.apache.hadoop") .setLevel (Level.OFF)

/ / load the original dataset

Val rowDS = spark.read.textFile ("/ train/lda/train.data")

Step3: dataset matrix transformation processing / / creating MatrixEntry in the form of MatrixEntry (row_index, column_index, value)

Val matrixEntry:RDD [MatrixEntry] = rowDS.rdd.map (_ .split (""))

.map (rowdata = > MatrixEntry (rowdata (0) .toLong, rowdata (1) .toLong, rowdata (2) .toDouble))

/ / create sparse matrix

Val sparseMatrix: CoordinateMatrix = new CoordinateMatrix (matrixEntry)

/ / create a LabeledPoint dataset

Val labelPointData = sparseMatrix.toIndexedRowMatrix.rows.map (r = > (r.index, r.vector.asML))

Val corpusDF = spark.createDataFrame (labelPointData) .toDF ("label", "features")

CorpusDF.saveAsTextFile ("/ tarin/lda/labelPointData")

The partial dataset after processing is shown below, where one line represents the feature vector of a document.

[4551, (53976, [23Metrical 27pyrus 29pyrus 30remiere 44pyrre 4548pyrus 425pyrus 748pyrus 825pyrus 930pyrus 995pyrus 1345pr 13872pyrus 16798pr 191384pr 2684pr 27081pr 29607pr 30801pr 31200pr 31202], [2.0meme 1.0pencils 1.01200jue 31202], [2.0pens 1.0pics 1.0pence 3.0pr 3.0pr 1.0pics 2.01.0jue 2.01.0ref 1.0, ref 1.0, ref 1.0, ref 1.0)]]

[2493, (53976, [80memo133, 754, 369999, 5190, 6367, 7361, 10367, 10344, 10949, 11390, 11683, 16206, 2270822709], [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0 and 1.0].

K-fold cross verification to determine training parameters

Cross-validation (cross validation) is a method to divide the dataset D into k mutually exclusive subsets of similar size, in which each subset maintains the consistency of data distribution as much as possible, that is, it is obtained from the dataset D by hierarchical sampling. Then, each time, the union of 1 subset of KMel is used as the training set, and the remaining subset is used as the test set. Through this processing, k groups of training sets and test sets can be obtained, and then k times of training and testing can be carried out. The final return is the mean value of the k test results. The stability and fidelity of the evaluation results of the cross-validation method depend on the value of k to a great extent. In order to highlight this point, the cross-verification method is usually called k-fold cross-validation (k-fold cross validation). K usually takes a value of 10, which is called 10-fold cross-validation. The following figure shows a schematic diagram of 10-fold cross-validation.

The perplexity index is an index of the generalization ability of the reaction model put forward by Blei, the original author of the LDA model, which has a certain representativeness and universality in evaluating the quality of the model. The so-called degree of confusion is the deterministic evaluation of the document when dividing the topic, which reflects the applicability of the model to the new sample. The smaller the degree of confusion, the better the generalization ability of the model.

The 10% discount cross-validation process is as follows

/ / divide the dataset into 10 parts, each accounting for 10%

Val splitData = labelPointData.randomSplit (Array.fill (10)

/ / set the number of topics to 15-25.

Val rangTopic = 15 to 25

RangTopic.foreach {k = >

Var perplexity = 0.0

For (I (row (1) .toLong, row (2) .toInt))

Val word_count = word_nums.reduceByKey (_ + _)

Val sort_wcnt = word_count.sortByKey (false)

/ / processing glossary

Val num_vocabulary = vocabulary.zipWithIndex () .map (row = > (row._2, row._1))

Val sort_combination = sort_wcnt.join (num_vocabulary)

.map (row = > (row._2._1, row._2._2))

.sortByKey (false)

.map (row = > (row._2, row._1))

Sort_combination.saveAsTextFile ("e:///combination")

}

Visually display the document word cloud image by using the wordcloud2 package of R language, as shown in the following figure

Data preprocessing based on k-means algorithm based on LDA in 3.2Spark platform

Because the eigenvector input accepted by k-means is in the form of [label,features], the original dataset schema needs to be transformed into the form of [label,features], that is, the topicDistribution column name is changed to features. The processing steps are as follows:

Val trainDF = transformed.select ("label", "topicDistribution") .toDF ("label", "features")

Model training

Spark ML's K-means algorithm provides the following parameter configuration:

SetFeaturesCol (value: String): set the input feature vector column. Default is featuressetK (value: Int): set the number of class clusters setMaxIter (value: Int): set the maximum number of iterations setPredictionCol (value: String): set the output column name. Default is predictionsetSeed (value: Long): set random number seed setTol (value: Double): set convergence threshold.

The maximum number of iterations is 200, the seed of random number is 123, the number of clusters is 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, and the other default values are selected to observe the changes of evaluation indicators. The specific code is as follows:

(2 to 20 by 2) toList.map {k = >

Val kmeans = new KMeans () .setK (k) .setSeed (123) .setMaxIter

Val kmeansModel = kmeans.fit (kMeansTrain)

/ / forecast result

Val predictions = kmeansModel.transform (kMeansTrain)

/ / calculate the sum of squares of errors

Val wssse = kmeansModel.computeCost (kMeansTrain)

Println (s "Within set sum of squared errors = $wssse")

/ / calculate the profile factor

Val evaluator = new ClusteringEvaluator ()

Val silhouette = evaluator.evaluate (predictions)

Println (s "Silhouette with squared euclidean distance = $silhouette")

/ / display clustering results

Println ("Cluster Centers:")

KmeansModel.clusterCenters.foreach (println)

}

After training, it is found that the clustering effect is the best when K is 6, and the clustering results are shown in the following table:

Model evaluation contour coefficient

Contour coefficient (Silhouette Coefficient) is a method to evaluate the clustering effect, which is used to measure the density and dispersion of clusters. It combines two factors of cohesion and separation, and can be used to evaluate the impact of different algorithms or different operation modes of algorithms on clustering results on the basis of the same original data. The value of contour coefficient is [- 1jue 1]. The larger the value is, the nearest sample distance is in the same class, and the longest sample distance is in different classes, that is, the closer the value is to 1, the more compact the cluster is, the better the clustering effect is.

Using K-means algorithm, the data set to be classified is divided into k clusters. For one of the points I, a (I) represents the average distance from the vector to other points in the cluster to which it belongs, and b (I) represents the average distance from the vector to the midpoint of other clusters. For a sample set, the average value of the contour coefficient of all samples is the total contour coefficient of the clustering result.

Sum of squares of error

The sum of squares of errors is also called the sum of squares of residuals, the sum of squares within the group, and so on. After fitting the appropriate model according to n observations, the remaining unfitted parts (ei=yi-y average) are called residuals, in which the y average represents the average of n observations, and the sum of the squares of all n residuals is called the sum of error squares.

When K takes different values, the calculated error is squared.

The calculated contour coefficient is shown in the following figure, combined with the sum of the square error and the contour coefficient, it has a better clustering effect when knot 6.

3.3 result analysis

First of all, through the LDA topic model, we can calculate the document-topic distribution and topic-word distribution of the document data set, and the number of topics trained is the number of class clusters.

For the document-topic distribution result of LDA training, the document is expressed as a vector composed of distribution on different topics. Because LDA takes into account the semantic relationship between words, the feature vector can better reflect the information of documents, so it can be used as the input of K-means clustering algorithm, so as to make up for the shortcomings of K-means algorithm based on space vector model. The experimental results show that when the cluster K is 6, the contour coefficient is 65.9661577458792, the sum of square error is 0.8266340036962969, the clustering effect is good.

This paper mainly introduces the implementation of k-means algorithm based on LDA under Spark platform. The text mining is designed in detail, the LDA model is trained on the open data set, and the document-topic distribution and topic-word distribution are described in detail. Finally, the K-means clustering algorithm based on LDA is implemented, which overcomes the shortcomings of the traditional K-means algorithm. The answer to the question about the implementation of the LDA-based k-means algorithm under the Spark platform is shared here. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel to learn more about it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.