In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces the relevant knowledge of "how to use the API CountVectorizer of Spark MLlib". The editor shows you the operation process through the actual case. The operation method is simple, fast and practical. I hope this article "how to use the API CountVectorizer of Spark MLlib" can help you solve the problem.
CountVectorizer
CountVectorizer and CountVectorizerModel are designed to help convert collections of text documents into frequency vectors. When a priori dictionary is not available, CountVectorizer can be used as an Estimator to extract the vocabulary and generate a CountVectorizerModel. The model generates a sparse matrix for the document based on the dictionary, which can be passed to other algorithms, such as LDA, to do some processing.
During the fitting process, CountVectorizer will count the word frequency from the entire document collection and sort the first vocabSize words.
An optional parameter, minDF, also affects the fitting process by specifying the minimum number of documents (or less than 1.0) that the vocabulary must appear. Another optional binary switch parameter controls the output vector. If set to true, all non-zero counts are set to 1. 0. This is especially useful for discrete probability models that simulate binary counting instead of integer counting.
An example is given to illustrate the algorithm.
Suppose we have a DataFrame with two columns: id and texts.
Id
Texts
0
Array ("a", "b", "c")
one
Array ("a", "b", "b", "c", "a")
Each line of texts is a document of type Array [String]. Call CountVectorizer using a dictionary (A _ Magi B _ C) to generate CountVectorizerModel. Then the converted output column "vector" contains
Vector column:
Id
Texts
Vector
0
Array ("a", "b", "c")
(3, [0jue 1pr 2], [1.0pl 1.0pl 1.0])
one
Array ("a", "b", "b", "c", "a")
(3, [0meme 1pr 2], [2.0je 2.0pr 1.0])
The words in the two documents are de-duplicated to form a dictionary. There are three words in this dictionary: a _ journal _ b _ ~ C, which is indexed as 0 ~ 1 ~ 1 ~ 2.
The document vector in the third column is composed of the dictionary-based index vector and the word frequency vector corresponding to the index.
The document vector is a sparse representation, and only three words in the example may not be sensed. In the actual business, the length of the dictionary is tens of thousands, while the words in the article may be hundreds or thousands, so the corresponding position word frequency of many indexes is 0.
Source code in spark
Guide package
Import org.apache.spark.ml.feature. {CountVectorizer, CountVectorizerModel}
Prepare data
Val df = spark.createDataFrame (Seq (
(0, Array ("a", "b", "c"))
(2, Array ("a", "b", "c", "c", "a"))
). ToDF ("id", "words")
Fitting CountVectorizerModel from full text set (automatic calculation Dictionary)
Val cvModel: CountVectorizerModel = new CountVectorizer ()
.setInputCol ("words")
.setOutputCol ("features")
.setVocabSize (3)
.setMinDF (2) .fit (df)
View the result
CvModel.transform (df) .show (false)
Specify advance dictionary
Val cvm = new CountVectorizerModel (Array ("a", "b", "c"))
.setInputCol ("words") .setOutputCol ("features")
To avoid repetition, recreate a set of data
Val df = spark.createDataFrame (Seq (
(0, Array ("a", "b", "c"))
(2, Array ("a", "b", "c", "c", "a"))
). ToDF ("id", "words")
View the result
Cvm.transform (df) .show (false)
This is the end of the content about "how to use the API CountVectorizer of Spark MLlib". Thank you for reading. If you want to know more about the industry, you can follow the industry information channel. The editor will update different knowledge points for you every day.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.