How to use the API CountVectorizer of Spark MLlib 07/03 Update SLTechnology News&Howtos

How to use the API CountVectorizer of Spark MLlib

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces the relevant knowledge of "how to use the API CountVectorizer of Spark MLlib". The editor shows you the operation process through the actual case. The operation method is simple, fast and practical. I hope this article "how to use the API CountVectorizer of Spark MLlib" can help you solve the problem.

CountVectorizer

CountVectorizer and CountVectorizerModel are designed to help convert collections of text documents into frequency vectors. When a priori dictionary is not available, CountVectorizer can be used as an Estimator to extract the vocabulary and generate a CountVectorizerModel. The model generates a sparse matrix for the document based on the dictionary, which can be passed to other algorithms, such as LDA, to do some processing.

During the fitting process, CountVectorizer will count the word frequency from the entire document collection and sort the first vocabSize words.

An optional parameter, minDF, also affects the fitting process by specifying the minimum number of documents (or less than 1.0) that the vocabulary must appear. Another optional binary switch parameter controls the output vector. If set to true, all non-zero counts are set to 1. 0. This is especially useful for discrete probability models that simulate binary counting instead of integer counting.

An example is given to illustrate the algorithm.

Suppose we have a DataFrame with two columns: id and texts.

Texts

Array ("a", "b", "c")

one

Array ("a", "b", "b", "c", "a")

Each line of texts is a document of type Array [String]. Call CountVectorizer using a dictionary (A _ Magi B _ C) to generate CountVectorizerModel. Then the converted output column "vector" contains

Vector column:

Texts

Vector

Array ("a", "b", "c")

(3, [0jue 1pr 2], [1.0pl 1.0pl 1.0])

one

Array ("a", "b", "b", "c", "a")

(3, [0meme 1pr 2], [2.0je 2.0pr 1.0])

The words in the two documents are de-duplicated to form a dictionary. There are three words in this dictionary: a _ journal _ b _ ~ C, which is indexed as 0 ~ 1 ~ 1 ~ 2.

The document vector in the third column is composed of the dictionary-based index vector and the word frequency vector corresponding to the index.

The document vector is a sparse representation, and only three words in the example may not be sensed. In the actual business, the length of the dictionary is tens of thousands, while the words in the article may be hundreds or thousands, so the corresponding position word frequency of many indexes is 0.

Source code in spark

Guide package

Import org.apache.spark.ml.feature. {CountVectorizer, CountVectorizerModel}

Prepare data

Val df = spark.createDataFrame (Seq (

(0, Array ("a", "b", "c"))

(2, Array ("a", "b", "c", "c", "a"))

). ToDF ("id", "words")

Fitting CountVectorizerModel from full text set (automatic calculation Dictionary)

Val cvModel: CountVectorizerModel = new CountVectorizer ()

.setInputCol ("words")

.setOutputCol ("features")

.setVocabSize (3)

.setMinDF (2) .fit (df)

View the result

CvModel.transform (df) .show (false)

Specify advance dictionary

Val cvm = new CountVectorizerModel (Array ("a", "b", "c"))

.setInputCol ("words") .setOutputCol ("features")

To avoid repetition, recreate a set of data

Val df = spark.createDataFrame (Seq (

(0, Array ("a", "b", "c"))

(2, Array ("a", "b", "c", "c", "a"))

). ToDF ("id", "words")

View the result

Cvm.transform (df) .show (false)

This is the end of the content about "how to use the API CountVectorizer of Spark MLlib". Thank you for reading. If you want to know more about the industry, you can follow the industry information channel. The editor will update different knowledge points for you every day.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.