How to count the word frequency under Spark 10/28 Update SLTechnology News&Howtos

How to count the word frequency under Spark

2025-10-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

How to count the word frequency under Spark? aiming at this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.

Install Spark

Download Spark 1.52 Pre-Built for hadoop 2.6 http://spark.apache.org/downloads.html. You also need to pre-install the Java,Scala environment.

Put the Spark directory file under / opt/spark-hadoop, and running. / spark-shell will show a window to connect to Scale; running. / python/pyspark will show a window to connect to Python. This indicates that the installation was successful.

Copy the pyspark under the python directory to the Python installation directory / usr/local/lib/python2.7/dist-packages. Only in this way can the pyspark library be imported into the program.

Test #! / usr/bin/python#-*-coding:utf-8-*-from pyspark import SparkConf, SparkContextimport osos.environ ["SPARK_HOME"] = "/ opt/spark-hadoop" APP_NAME = "TopKeyword" if _ _ name__ = "_ _ main__": logFile = ". / README.md" sc = SparkContext ("local") "Simple App") logData = sc.textFile (logFile). Cache () numAs = logData.filter (lambda's:'a'in s). Count () numBs = logData.filter (lambda s: 'b'in s). Count () print ("Lines with a:% I, lines with b:% I"% (numAs, numBs))

Print the result

Lines with a: 3, lines with b: 2 word frequency count #! / usr/bin/python#-*-coding:utf-8-*-from pyspark import SparkConf, SparkContextimport osimport sysreload (sys) sys.setdefaultencoding ("utf-8") os.environ ["SPARK_HOME"] = "/ opt/spark-hadoop" def divide_word (): word_txt = open ('question_word.txt', 'a') with open ('question_title.txt') 'r') as question_txt: question = question_txt.readline () while (question): seg_list = jieba.cut (question) Cut_all=False) line = ".join (seg_list) word_txt.write (line) question = question_txt.readline () question_txt.close () word_txt.close () def word_count (): sc = SparkContext (" local " "WordCount") text_file = sc.textFile (". / question_word.txt"). Cache () counts = text_file.flatMap (lambda line: line.split ("))\ .map (lambda word: (word, 1))\ .reduceByKey (lambda a) B: a + b) counts.saveAsTextFile (". / wordcount_result.txt") if _ _ name__ = = "_ _ main__" word_count () the answer to the question about how word frequency counting is performed under Spark is shared here. I hope the above content can help you to a certain extent, if you still have a lot of doubts to be solved, you can follow the industry information channel to learn more related knowledge.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.