In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-31 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
This article will explain in detail how to achieve the function of word frequency statistics in Python. The content of the article is of high quality, so the editor shares it for you as a reference. I hope you will have a certain understanding of the relevant knowledge after reading this article.
Data preparation
Import jiebawith open ("D:/hdfs/novels/ The Demi-Gods & Semi-Devils .txt", encoding= "gb18030") as f: text = f.read () with open Encoding= "utf-8") as f: for line in f: if line.startswith ("The Demi-Gods & Semi-Devils"): names = next (f). Split () breakfor word in names: jieba.add_word (word) # load the deprecated word with open ("stoplist.txt", encoding= "utf-8-sig") as f: stop_words = f.read (). Split () stop_words.extend (['The Demi-Gods & Semi-Devils'' Stop_words = set (stop_words) all_words = [word for word in cut_word if len (word) > 1 and word not in stop_words] print (len (all_words), all_words [: 20])
Results:
216435 ['Tianlong', 'Shi Ming', 'Qingyi', 'upright', 'dangerous Peak', 'Xing Yupi', 'Yuehua', 'Ming Ma', 'Jixiang', 'Cliff', 'Gao Yuan', 'Micro step', 'Sheng Jia','Zi Zi', 'Jiayuan', 'regret', 'Tiger Xiao', 'Dragon Yin', 'change nests' 'Huan Feng'] the top N words in statistical frequency
Statistics of self-written codes in the original dictionary:
Wordcount = {} for word in all_words: wordcount [word] = wordcount.get (word, 0) + 1sorted (wordcount.items (), key=lambda x: X [1], reverse=True) [: 10]
Use the counting class for word frequency statistics:
From collections import Counterwordcount = Counter (all_words) wordcount.most_common (10)
Results:
Use pandas for word frequency statistics:
Direct statistics of word frequency in the process of pd.Series (all_words). Value_counts (). Head (10) word segmentation
Pandas can only count the word frequency of words that have already been divided, so I won't demonstrate it here. The above tests show that Counter counts lists directly faster than pyhton native bands, but the performance in the loop is unknown, so let's continue with the test.
First, use native API to directly count the word frequency and sort it:
% timewordcount = {} for word in jieba.cut (text): if len (word) > 1 and word not in stop_words: wordcount [word] = wordcount.get (word, 0) + 1print (sorted (wordcount.items (), key=lambda x: X [1], reverse=True) [: 10])
Results:
[('Duan Yu', 2496), ('Shuo', 2151), ('Xu Zhu', 1633), ('Xiao Feng', 1301), ('Wu Gong', 1095), ('A Zi', 1095), ('A Zhu', 1095), ('Qiao Feng', 2151), ('Wang Yuyan', 877), ('Murong Fu', 871)]
Let's use Counter to count the word frequency and sort it:
% timewordcount = Counter () for word in jieba.cut (text): if len (word) > 1 and word not in stop_words: wordcount [word] + = 1print (wordcount.most_common (10))
Results:
[('Duan Yu', 2496), ('said', 2151), ('Xuzhu', 1633), ('Xiao Feng', 1301), ('martial arts', 1095), ('A Zi', 1095), ('A Zhu', 1095), ('Qiao Feng', 2151), ('Wang Yuyan', 877), ('Murong Fu', 871)] Wall time: 6.21s on how to achieve word frequency statistics in Python. I hope the above content can be of some help to you and learn more knowledge. If you think the article is good, you can share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.