In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly introduces python how to achieve statistics of Chinese characters / English word number of regular expression, has a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let Xiaobian with you to understand.
Train of thought
Use the regular expression "(? X) (?: [\ w -] + | [\ x80 -\ xff] {3})" to get a list of English words and Chinese characters in the utf-8 document.
Use dictionary to record the frequency of each word / Chinese character, + 1 if it does, and 1 if it doesn't.
Sort the dictionary by value and output.
Source code
#! / usr/bin/python#-*-coding: utf-8-*-# # author: rex#blog: http://iregex.org#filename counter.py#created: Mon Sep 20 21:00:52 2010#desc: convert .py file to html with VIM.import sysimport refrom operator import itemgetterdef readfile (f): with file (f, "r") as pFile:return pFile.read () def divide (c, regex): # the regex below is only valid for utf8 codingreturn regex.findall (c) def update_dict (di) Li): for i in li:if di.has_key (I): Di [I] + = 1else: Di [I] = 1return didef main (): # receive files from bashfiles=sys.argv [1:] # regex compile only onceregex=re.compile ("(? X) (?: [\ w -] + | [\ x80 -\ xff] {3})") dict= {} # get all words from filesfor f in files:words=divide (readfile (f), regex) dict=update_dict (dict) Words) # sort dictionary by value#dict is now a list.dict=sorted (dict.items (), key=itemgetter (1), reverse=True) # output to standard-outputfor I in dict:print I [0], I [1] if _ _ name__=='__main__':main ()
Tips
Because files=sys.argv [1:] is used to receive parameters,. / counter.py file1 file2... The word frequency of the file specified by the parameter can be calculated and output.
You can customize the program. For example,
Use th
Regex=re.compile ("(? X) ([\ w -] + | [\ x80 -\ xff] {3})") words= [w for w in regex.split (line) if w]
The resulting list is a list of words including delimiters, which is convenient for full-text word segmentation later.
Processing files in behavioral units, rather than reading the entire file into memory, saves memory when dealing with large files.
You can use such a regular expression to preprocess the entire file first, removing the possible html tags: content=re.sub (r "] +", ", content), which is more accurate for some documents.
Thank you for reading this article carefully. I hope the article "how to count the regular expression of Chinese characters / English words by python" shared by the editor is helpful to everyone. At the same time, I also hope you can support us and pay attention to the industry information channel. More related knowledge is waiting for you to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
The rebuild server detects the list of rebuild requests.
© 2024 shulou.com SLNews company. All rights reserved.