In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
This article mainly introduces python how to identify obscure Chinese characters, which has a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, let the editor take you to understand it.
Problem point
Originally considered to use rules to judge Chinese, because the Internet found that regular matching Chinese is [\ u4e00 -\ u9fa5]. Then the code was almost finished, and it was found that some obscure words were no longer in this range.
Identification method
Under utf-8 character coding, a Chinese character occupies 3 bytes, but the length of the character is only 1.
1. Whether the method of analyzing Chinese can be flexible as len (bytes (str,'utf-8) = = 3 and len (string) = = 1).
2. After judging Chinese in text writing, if it is a Chinese character str.ljust (5), otherwise it is str.ljust (6) (because one Chinese character accounts for the length of two characters).
Example
From pypinyin import pinyinimport re class ChangePinyin: def _ init__ (self, filename): self.file = filename self.lyric = self.read_file () self.pinyin = [] def read_file (self): with open (self.file, encoding='utf-8') as f: return f.readlines () def write_file (self): with open ('New_%s'% self.file,' w') Encoding='utf-8') as f: print (self.lyric) for line in self.lyric: # print (line) if line.strip () ='': continue _ new_line = re.sub (r'\ s,'' Line) # Line content to Pinyin _ pinyin =''.join (map (lambda x: X [0] .ljust (6), pinyin (_ new_line) # according to Chinese and English Split line content from Chinese characters _ lyric = self.split_words (_ new_line) f.write ('% s\ n% s\ n% s\ n'% (_ pinyin, _ lyric)) @ staticmethod def split_words (words): word_list = "" tmp = "" for string in words: if len (bytes (string) 'utf-8')) = = 3 and len (string) = = 1: if tmp! ='': word_list + = tmp.ljust (6) tmp = "" word_list + = string.ljust (5) else: tmp + = string return word_list if _ name__ = ='_ _ main__': Main = ChangePinyin ('lyric.txt') Main.write_file () Thank you for reading this article carefully I hope the article "how to identify obscure Chinese characters in python" shared by the editor will be helpful to you. At the same time, I also hope that you will support us and pay attention to the industry information channel. More related knowledge is waiting for you to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.