Example Analysis based on Spark Mllib text Classification 04/19 Update SLTechnology News&Howtos

Example Analysis based on Spark Mllib text Classification

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail the example analysis based on Spark Mllib text classification. The editor thinks it is very practical, so I share it for you as a reference. I hope you can get something after reading this article.

Text Classification based on Spark Mllib

Text classification is a typical machine learning problem, and its main goal is to get the classification model by training the existing corpus text data, and then predict the category label of the new text. This has practical application scenarios in many fields, such as automatic news classification of news websites, spam detection, illegal information filtering and so on. This paper will realize the classification of new data samples by training a mobile phone short message sample data set, and then detect whether it is spam or not. the basic steps are as follows: first, convert the text sentence into a word array, then use Word2Vec tool to transform the word array into a K-dimensional vector, and finally get a feedforward neural network model by training K-dimensional vector sample data, so as to realize the text category label prediction. The case implementation of this paper adopts the word vectorization tool Word2Vec and multilayer perceptron classifier (Multiple Layer Perceptron Classifier) in Spark ML.

Introduction to Word2Vec

Word2Vec is a tool for representing words as numerical vectors. Its basic idea is to map words in the text to a K-dimensional numerical vector (K is usually used as a superparameter of the algorithm), so that all words in the text form a K-dimensional vector space, so that we can get the semantic similarity of the text by calculating the Euclidean distance or cosine similarity between the vectors. Word2Vec uses the word vector representation of Distributed representation, which can not only effectively control the dimension of the word vector and avoid the disaster of dimensionality (relative to one-hot representation), but also ensure that words with similar meanings are closer in the vector space.

There are two models in the implementation of Word2Vec: CBOW (Continuous Bag of Words, continuous bag of words) and Skip-Gram. To sum up, the difference is that CBOW predicts the target word according to context, while Skip-Gram predicts context according to the current word. The implementation of Spark uses the Skip-Gram model. Suppose we have N samples of word sequences to be trained, marked as W1 ~ W2. WN, the training goal of the Skip-Gram model is to maximize the mean logarithmic likelihood, that is,

Where N is the number of words and K is the window size of the word context. In the Skip-Gram model, the probability between words in a certain context window is calculated, and in general, the larger the context window, the more comprehensive the word combination can be covered, which can bring more accurate results, but the disadvantage is that it will also increase the training time.

In the Skip-Gram model, each word is associated with two vectors, representing the word vector and the context vector respectively. Because of this, Word2Vec can express richer and more accurate semantic information than the traditional LDA (Latent Dirichlet Allocation) process.

The Word2Vec implementation of Spark provides the following main tunable parameters:

InputCol, the name of the array of text words stored in the source data DataFrame.

OutputCol, the processed numeric feature vector stores the column name.

VectorSize, the dimension size of the target numeric vector. The default is 100.

WindowSize, the context window size, defaults to 5.

NumPartitions, the number of partitions for training data. Default is 1.

MaxIter, the algorithm calculates the maximum number of iterations, less than or equal to the number of partitions. The default is 1.

MinCount, only when a word appears more than or equal to minCount, will be included in the vocabulary, otherwise it will be ignored.

StepSize, optimize the learning rate of each iteration of the algorithm. The default value is 0.025.

These parameters can be set through the setXXX method when constructing the Word2Vec instance.

Multilayer perceptron

Multilayer perceptron (MLP, Multilayer Perceptron) is a multi-layer feedforward neural network model. The so-called feedforward neural network means that it only receives the input of the previous layer from the input layer, and outputs the calculation results to the latter layer, and does not give feedback to the previous layer. The whole process can be represented by a directed acyclic graph. This type of neural network consists of three layers, namely the input layer (Input Layer), one or more hidden layers (Hidden Layer), and the output layer (Output Layer), as shown in the figure:

After version 1.5, Spark ML provides a multi-layer perceptron trained by BP (back Propagation, Back Propagation) algorithm. The learning purpose of BP algorithm is to adjust the connection weight of the network, so that the adjusted network can get the desired output to any input. The back propagation in the name of BP algorithm means that the algorithm transmits the error layer by layer in the process of training the network, and modifies the connection weights between neurons one by one, so that the output of the network can reach the desired error after calculating the input information. Spark's multilayer perceptron hidden layer neurons use the sigmoid function as the activation function, and the output layer uses the softmax function.

Spark's Multilayer Perceptron Classifier (MultilayerPerceptronClassifer) supports the following tunable parameters:

FeaturesCol: enter the name of the metric characteristic column in the data DataFrame.

LabelCol: enter the name of the tag column in the data DataFrame.

Layers: this parameter is an integer array type. The first element needs to be equal to the dimension of the feature vector, and the last element needs to have the same number of label values of the training data. For example, write 2 for the 2 classification problem. The number of elements in the middle represents the number of hidden layers in the neural network, and the value of the elements represents the number of neurons in that layer. For example, val layers = Array [Int] (100, 6, 5, 5, 2).

MaxIter: the maximum number of iterations solved by the optimization algorithm. The default value is 100.

PredictionCol: the column name of the forecast result.

Tol: convergence threshold of the iterative solution process of the optimization algorithm. The default value is 1e-4. Cannot be negative.

BlockSize: this parameter is used by the feedforward network trainer to divide each partition of the training sample data into different groups according to the blockSize size, and each sample in each group will be superimposed into a vector to facilitate transmission between various optimization algorithms. The recommended value for this parameter is 10-1000, and the default value is 128.

The return of the algorithm is an instance of the MultilayerPerceptronClassificationModel class.

Target dataset preview

In the introduction, the author has briefly introduced the main task of this paper, that is to train a multi-layer perceptron classification model to predict whether the new message is spam or not. The target dataset we use here is the SMS Spam Collection dataset from UCI, which has a very simple structure with only two columns. The first column is the label of the SMS, and the second column is the content of the SMS. The two columns are separated by a tab. Although the UCI dataset can be used for free, the author solemnly declares that the copyright of the dataset belongs to UCI and its original contributors.

Dataset download link: http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

Case study and realization

In the process of dealing with the classification and prediction of text short messages, the author first divides the original text data into training and test data sets according to the proportion of 8:2. The whole process is divided into the following steps

Read the original dataset locally and create a DataFrame.

Use StringIndexer to convert the original text tag ("Ham" or "Spam") into a numeric phenotype for Spark ML processing.

Use Word2Vec to convert text messages into numeric word vectors.

Use MultilayerPerceptronClassifier to train a multilayer perceptron model.

Use LabelConverter to convert the numeric label of the predicted result into the original text label.

Finally, the prediction accuracy of the model is tested on the test data set.

The specific implementation of the algorithm is as follows:

1, import the package first

Import org.apache.spark.ml.Pipeline

Import org.apache.spark.ml.classification.MultilayerPerceptronClassifier

Import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator

Import org.apache.spark.ml.feature. {IndexToString, StringIndexer, Word2Vec}

2. Create sets and participle

Val parsedRDD = sc.textFile ("file:///opt/datas/SMSSpamCollection").map(_.split("") .map (eachRow = > {

(eachRow (0), eachRow (1). Split ("))

})

Val msgDF = spark.createDataFrame (parsedRDD) .toDF ("label", "message")

3, convert the label to an index value

Val labelIndexer = new StringIndexer () .setInputCol ("label") .setOutputCol ("indexedLabel") .fit (msgDF)

4. Create a Word2Vec with a participle vector size of 100

Final val VECTOR_SIZE = 100

Val word2Vec = new Word2Vec () .setInputCol ("message") .setOutputCol ("features") .setVectorSize (VECTOR_SIZE) .setMinCount (1)

5, create a multilayer perceptron

There are VECTOR_SIZE in the input layer, 5 neurons in the middle layer and 2 neurons in the output layer.

Val layers = Array [Int] (VECTOR_SIZE,6,5,2)

Val mlpc = new MultilayerPerceptronClassifier (). SetLayers (layers) .setBlockSize (512) .setSeed (1234L) .setMaxIter (128,128). SetFeaturesCol ("features"). SetLabelCol ("indexedLabel"). SetPredictionCol ("prediction")

6, convert the index to the original tag

Val labelConverter = new IndexToString () .setInputCol ("prediction") .setOutputCol ("predictedLabel") .setLabels (labelIndexer.labels)

7, dataset segmentation

Val Array (trainingData, testData) = msgDF.randomSplit (Array (0.8,0.2))

8, create pipeline and train data

Val pipeline = new Pipeline () .setStages (Array (labelIndexer,word2Vec,mlpc,labelConverter))

Val model = pipeline.fit (trainingData)

Val predictionResultDF = model.transform (testData)

/ / below 2 lines are for debug use

PredictionResultDF.printSchema

PredictionResultDF.select ("message", "label", "predictedLabel"). Show (30)

9. Evaluate the training results

Val evaluator = new MulticlassClassificationEvaluator () .setLabelCol ("indexedLabel") .setPredictionCol ("prediction") .setMetricName ("precision")

Val predictionAccuracy = evaluator.evaluate (predictionResultDF)

Println ("Testing Accuracy is 2.4f" .format (predictionAccuracy * 100) + "%")

This is the end of this article on "sample Analysis based on Spark Mllib text Classification". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.