What is the Tokenizer participle based on DF 10/26 Update SLTechnology News&Howtos

What is the Tokenizer participle based on DF

2025-10-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces what the Tokenizer participle based on DF is like, the content is very detailed, interested friends can refer to, hope to be helpful to you.

Tokenizer participle

Before text analysis, we deal with the first step of word segmentation of sentences in the text. We are all Spark machine learning libraries are divided into RDD-based and DataFrame-based libraries, because RDD-based libraries are in a state of maintenance after Spark2.0, the participle we are talking about here is based on Spark's Dataframe. Mainly explain the use of two classes Tokenizer and RegexTokenizer.

1 prepare the data first

Guide package

Import org.apache.spark.ml.feature. {RegexTokenizer, Tokenizer}

Import org.apache.spark.sql.functions._

Quasi data

Val sentenceDataFrame = spark.createDataFrame (Seq (

(0, "Hi I heard about Spark")

(1, "I wish Java could use case classes")

(2, "Logistic,regression,models,are,neat")

). ToDF ("id", "sentence")

2 Tokenizer

Tokenizer is responsible for reading documents or sentences and breaking them down into words. Declare a variable

Val tokenizer = new Tokenizer () .setInputCol ("sentence") .setOutputCol ("words")

Custom function to get the number of words in each column

Val countTokens = udf {(words: Seq [String]) = > words.length}

Call the conversion function

Val tokenized = tokenizer.transform (sentenceDataFrame)

Tokenized.select ("sentence", "words") .withColumn ("tokens", countTokens (col ("words")) .show (false)

3 RegexTokenizer

RegexTokenizer allows documents to be segmented into phrases in a regular manner. By default, the parameter "pattern" (regex, default: "\ s +") is used as the delimiter to split the input text. Alternatively, the user can set the parameter "gaps" to false, indicating that the regular expression "pattern" means "tokens" instead of splitting the gap, and find all matching events as the result of segmentation.

Val regexTokenizer = new RegexTokenizer () .setInputCol ("sentence") .setOutputCol ("words") .setPattern ("\ W")

/ / can also be replaced with .setPattern ("\ w +") .setGaps (false)

Start the conversion and view the execution result

Val regexTokenized = regexTokenizer.transform (sentenceDataFrame)

RegexTokenized.select ("sentence", "words") .withColumn ("tokens", countTokens (col ("words")) .show (false)

On the DF-based Tokenizer participle is how to share here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.