In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
In this issue, the editor will bring you about the open source dictionaries and tools of NLP. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can get something after reading this article.
Preface
With the popularity of pre-training models such as BERT, ERNIE, XLNet, etc., it always seems a little out of date to solve NLP problems without pre-training models. But this is obviously wrong.
As we all know, no matter training or reasoning, the pre-training model will consume a lot of computing power and highly rely on GPU computing resources. However, there are a lot of NLP problems that can actually be done only by dictionaries + rules, so forcing a bulky model at this time is tantamount to hitting mosquitoes with anti-aircraft guns, and the performance-to-price ratio is very low.
So Xiaoxi carefully selected 45 more practical open source gadgets and dictionaries from a crazy github repo, so that in the process of building a NLP system and assisting alchemy, there was less reliance on models and calculations, and more small and beautiful code.
1. Textfilter: Chinese and English sensitive words filter repo: observerss/textfilter > > f = DFAFilter () > > f.add ("sexy") > f.filter ("hello sexy baby") hello * baby
Sensitive words include politics, swearing and other topic words. Its principle is mainly based on dictionary search (keyword file in the project), and the content is not very halal.
2. Langid:97 language detection repo: saffsd/langid.pypip install langid > > import langid > langid.classify ("This is a test") ('en',-54.41310358047485) 3. Langdetect: another language detection
Address: https://code.google.com/archive/p/language-detection
Pip install langdetect
From langdetect import detectfrom langdetect import detect_langss1 = "this blog mainly introduces two language detection tools to distinguish what language the text is." S2 ='We are pleased to introduce today a new technology'print (detect (S1)) print (detect (S2)) print (detect_langs (S3)) # detect_langs () output all the language types detected and their proportion
The output results are as follows: note: the language type mainly refers to the ISO 639-1 language coding standard. For more information, please see ISO 639-1 Baidu Encyclopedia.
Compared with the last language detection, the accuracy is low and the efficiency is high.
4. Phone China Mobile phone attribution Enquiry:
Repo: ls0f/phone
Integrated into python package cocoNLP
From phone import Phonep = Phone () p.find (18100065143) # return {'phone':' 18100065143, 'province':' Shanghai, 'city':' Shanghai', 'zip_code':' 200000 Shanghai, 'area_code':' 021, 'phone_type':' Telecom'}
Section of support number: 13,15,18 minutes, 14 [5, 7], 17 [0rec 6, 7, 8]
Number of records: 360569 (updated: April 2017)
The author provides data phone.dat to facilitate Load data for non-python users.
5. Phone international mobile phone, telephone attribution inquiry: repo: AfterShip/phonenpm install phoneimport phone from 'phone';phone (' + 8526569-8900'); / / return ['+ 852656989009', 'HKG'] phone (' (817) 569-8900'); / / return ['+ 18175698900, 'USA'] 6. Ngender determines gender by name: repo: observerss/ngender
Probability based on naive Bayesian calculation
Pip install ngender > > import ngender > > ngender.guess ('Zhao Benshan') ('male', 0.9836229687547046) > ngender.guess (' Song Dandan') ('female', 0.9759486128949907) 7. Extract the regular expression of email
Integrated into python package cocoNLP
Email_pattern ='^ [* #\ u4e00 -\ u9fa5 a-zA-Z0-9room.-] + @ [a-zA-Z0-9 -] + (\ .[ a-zA-Z0-9 -] +) *\. [a-zA-Z0-9] {2lle 6} $'emails = re.findall (email_pattern, text, flags=0) 8. Extract the regular expression of phone_number
Integrated into python package cocoNLP
Cellphone_pattern ='^ (13 [0-9]) | (14 [0-9]) | (15 [0-9]) | (17 [0-9]) | (18 [0-9]))\ d {8} $'phoneNumbers = re.findall (cellphone_pattern, text, flags=0) 9. The regular expression IDCards_pattern = r'^ ([1-9]\ d {5} [12]\ d {3} (0 [1-9] | 1 [012])) (0 [1-9] | [12] [0-9] | 3 [01])\ d {3} [0-9xX]) $'IDs = re.findall (IDCards_pattern, text, flags=0) 10. Person name corpus:
Repo: wainshine/Chinese-Names-Corpus
The name extraction function has been added to python package cocoNLP.
Chinese (modern, ancient) name, Japanese name, Chinese surname and first name, address (period, aunt, etc.), English-> Chinese name (John Lee), idiom dictionary
(can be used for Chinese word segmentation and name recognition)
11. Repo: zhangyics/Chinese-abbreviation-dataset National people's Congress: national people's Congress / n China: people's Republic of China / ns Women's Tennis match: women's / n Tennis match / vn12. Chinese word-splitting Dictionary: repo: kfcd/chaizi word splitting method (1) disassembly method (2) disassembly method (3) reprimand only 13. Vocabulary emotional value: repo: rainarch/SentiBridge mountain spring water is abundant 0.400704566541 0.370067395878 broad vision 0.305762728932 0.325320747491 Grand Canyon adventure 0.312137906517 0.37859495728114. Chinese lexicon, discontinued words, sensitive words repo: dongxiexidian/Chinese
The sensitive thesaurus of this package is classified in more detail:
Reactionary thesaurus, sensitive thesaurus statistics, violent terror thesaurus, livelihood thesaurus, pornographic thesaurus
15. Chinese character to Pinyin: repo: mozillazg/python-pinyin
Text error correction will be used
16. Chinese simplified version: repo: skydark/nstools17. English simulated Chinese pronunciation engine repo: tinyfool/ChineseWithEnglishsay wo I ni# said: I love you
It is equivalent to using English phonetic symbols to simulate Chinese pronunciation.
18. Synonym, antonym, negative thesaurus: repo: guotong1988/chinese_dictionary19. Chinese character data repo: skishore/makemeahanzi
Stroke order of simplified / traditional Chinese characters
Vector stroke
20. Words are segmented and extracted without spaces: repo: keredson/wordninja > import wordninja > wordninja.split ('derekanderson') [' derek', 'anderson'] > wordninja.split (' imateapot') ['im',' a', 'teapot'] 21. IP address regular expression: (25 [0-5] | 2 [0-4]\ d | [0-1]\ d {2} | [1-9]?\ d)\. (25 [0-5] | 2 [0-4]\ d | [0-1]\ d {2} | [1-9]?\ d)\. (25 [0-5] | 2 [0-4]\ d | [0-1]\ d {2} | [1-9]?\ d)\. (25 [0-5] | 2 [0-4]\ d | [0-1]\ d {2} | [1-9]?\ d) 22. Tencent QQ regular expression: [1-9] ([0-9] {5jue 11}) 23. Regular expression of domestic fixed-line number: [0-9-() ()] {7pc18} 24. Regular expression of user name: [A-Za-z0-9 _\ -\ u4e00 -\ u9fa5] + 25. G2pC: context-based Chinese pronunciation automatic marking module repo: Kyubyong/g2pC26. Time extraction:
Integrated into python package cocoNLP
The test was performed 9:44 on June 7, 2016, and the results are as follows: Hi,all. Meeting at 3 p.m. Next Monday > > 2016-06-13 15:00:00-false Monday meeting > > 2016-06-13 00:00:00-true meeting next Monday > > 2016-06-20 00:00:00-true
Java version:
Https://github.com/shinyke/Time-NLP
Python version:
Https://github.com/zhanzecheng/Time_NLP
twenty-seven。 Rapid conversion of "Chinese numerals" and "Arabic numerals" repo: HaveTwoBrush/cn2an
Conversion between Chinese and Arabic numerals
The mixture of Chinese and Arabic numerals is under development.
twenty-eight。 Company name repo: wainshine/Company-Names-Corpus29. Repo: panhaiqi/AncientPoetry of the lexicon of ancient poetry
A more complete lexicon of ancient poetry:
Https://github.com/chinese-poetry/chinese-poetry
thirty。 THU's Thesaurus repo: http://thuocl.thunlp.org/
It has been sorted into the data folder of this repo.
IT thesaurus, financial thesaurus, idiom thesaurus, toponymic thesaurus, historical celebrity thesaurus, poetry thesaurus, medical thesaurus, diet thesaurus, legal thesaurus, automobile thesaurus, animal thesaurus 31. PDF table data extraction tool
Repo: camelot-dev/camelot
thirty-two。 Domestic telephone number regular matching (three major operators + virtual, etc.) repo: VincentSit/ChinaMobilePhoneNumberRegex33. User name blacklist list: repo: marteinn/The-Big-Username-Blacklist
Contains a disabled list of user names, such as:
Administratoradministrationautoconfigautodiscoverbroadcasthostdomaineditorguesthosthostmasterinfokeybase.txtlocaldomainlocalhostmastermailmail0mail34. Microsoft multilingual numbers / units / such as date and time identification package: repo: Microsoft/Recognizers-Text35. Chinese-xinhua Chinese Xinhua Dictionary Database and api, including commonly used Xiehouyu, idioms, words and Chinese characters repo: pwxcoo/chinese-xinhua36. Automatic generation of repo: liuhuanyong/TextGrapher for document atlas
TextGrapher-Text Content Grapher based on keyinfo extraction by NLP method. Input a document, extract the key information of the document, structure it, and finally organize it into a graph organization form to form a graphic display of the semantic information of the article.
thirty-seven。 The digital name library repo: google/UniNum38 in 186 languages. Complex and simplified conversion repo: berniey/hanziconv39. Chinese character feature extractor (featurizer), extract Chinese character features (pronunciation features, glyph features) for deep learning features repo: howl-anderson/hanzi_char_featurizer40. Chinese abbreviated dataset repo: zhangyics/Chinese-abbreviation-dataset41. No Tao Dictionary-the command line version of the youdao dictionary that supports English-Chinese mutual search and online query repo: ChestnutHeng/Wudao-dict42. Best Chinese numerals (Chinese numerals)-Arabic numeral conversion tool repo: Wall-ee/chinese2digits43. LineFlow: an efficient NLP data loader repo: tofunlp/lineflow44 for all deep learning frameworks. Parse the natural language number string into integer and floating point number repo: jaidevd/numerizer45. Repo: zacanger/profane-words these are the NLP open source dictionaries and tools that Xiaobian shared with you. If you happen to have similar doubts, please refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.