How to implement TF-IDF with spark mllib 07/06 Update SLTechnology News&Howtos

How to implement TF-IDF with spark mllib

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article is about how spark mllib implements TF-IDF. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

The running code is as follows: package spark.FeatureExtractionAndTransformationimport org.apache.spark.mllib.feature. {HashingTF, IDF} import org.apache.spark. {SparkContext SparkConf} / * TF-IDF is a simple text feature extraction algorithm * word frequency (Term Frequency): the number of times a keyword appears in the text * inverse document frequency (Inverse Document Frequency): size is inversely proportional to the common degree of a word * TF = the number of times a word appears in the article / the total number of words in the article * IDF = log (total number of articles found / (number of articles containing this word + 1) * TF-IDF = TF (word frequency) x IDF (inverse document frequency) * the removal of stop words (auxiliary words) is not considered here. Such as adverbs, prepositions, etc.) and * semantic reconstruction ("data mining", "data structure", split into "data", "mining", "data", "structure") * two completely different texts with 50% similarity are very serious errors. * Created by eric on 16-7-24. * / object TF_IDF {val conf = new SparkConf () / / create environment variable .setMaster ("local") / / set localization handler .setAppName ("TF_IDF") / / set name val sc = new SparkContext (conf) def main (args: Array) [String]) {val documents = sc.textFile ("/ home/eric/IdeaProjects/wordCount/src/main/spark/FeatureExtractionAndTransformation/a.txt") .map (_ .split (") .toSeq) val hashingTF = new HashingTF () / / first create a TF computing instance val tf = hashingTF.transform (documents) .cache () / / calculate the document TF value val idf = new IDF () .fit (tf) / / create IDF instance and calculate val tf_idf = idf.transform (tf) / / calculate TF_IDF word frequency tf_idf.foreach (println) / / (1048576 [179334 596178], [1.09861228866810988Magna 0.6931471805599453]) / / (1048576, [586461], [0.1823215567939546]) / / (1048576, [422129586461], [0.693147180559945346]) / / (1048576, [586461prompt 596178], [0.1823215567939546]) / / (1048576, [42291298], [586461], [0.693147180994530.18215567979546]}} a.txthello mllibsparkgoodBye sparkhello sparkgoodBye spark results as follows

Thank you for reading! This is the end of the article on "how to achieve TF-IDF in spark mllib". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it out for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.