What are the NLP open source dictionaries and tools 07/12 Update SLTechnology News&Howtos

What are the NLP open source dictionaries and tools

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

In this issue, the editor will bring you about the open source dictionaries and tools of NLP. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can get something after reading this article.

Preface

With the popularity of pre-training models such as BERT, ERNIE, XLNet, etc., it always seems a little out of date to solve NLP problems without pre-training models. But this is obviously wrong.

As we all know, no matter training or reasoning, the pre-training model will consume a lot of computing power and highly rely on GPU computing resources. However, there are a lot of NLP problems that can actually be done only by dictionaries + rules, so forcing a bulky model at this time is tantamount to hitting mosquitoes with anti-aircraft guns, and the performance-to-price ratio is very low.

So Xiaoxi carefully selected 45 more practical open source gadgets and dictionaries from a crazy github repo, so that in the process of building a NLP system and assisting alchemy, there was less reliance on models and calculations, and more small and beautiful code.

1. Textfilter: Chinese and English sensitive words filter repo: observerss/textfilter > > f = DFAFilter () > > f.add ("sexy") > f.filter ("hello sexy baby") hello * baby

Sensitive words include politics, swearing and other topic words. Its principle is mainly based on dictionary search (keyword file in the project), and the content is not very halal.

2. Langid:97 language detection repo: saffsd/langid.pypip install langid > > import langid > langid.classify ("This is a test") ('en',-54.41310358047485) 3. Langdetect: another language detection

Address: https://code.google.com/archive/p/language-detection

Pip install langdetect

From langdetect import detectfrom langdetect import detect_langss1 = "this blog mainly introduces two language detection tools to distinguish what language the text is." S2 ='We are pleased to introduce today a new technology'print (detect (S1)) print (detect (S2)) print (detect_langs (S3)) # detect_langs () output all the language types detected and their proportion

The output results are as follows: note: the language type mainly refers to the ISO 639-1 language coding standard. For more information, please see ISO 639-1 Baidu Encyclopedia.

Compared with the last language detection, the accuracy is low and the efficiency is high.

4. Phone China Mobile phone attribution Enquiry:

Repo: ls0f/phone

Integrated into python package cocoNLP

From phone import Phonep = Phone () p.find (18100065143) # return {'phone':' 18100065143, 'province':' Shanghai, 'city':' Shanghai', 'zip_code':' 200000 Shanghai, 'area_code':' 021, 'phone_type':' Telecom'}

Section of support number: 13,15,18 minutes, 14 [5, 7], 17 [0rec 6, 7, 8]

Number of records: 360569 (updated: April 2017)

The author provides data phone.dat to facilitate Load data for non-python users.

5. Phone international mobile phone, telephone attribution inquiry: repo: AfterShip/phonenpm install phoneimport phone from 'phone';phone (' + 8526569-8900'); / / return ['+ 852656989009', 'HKG'] phone (' (817) 569-8900'); / / return ['+ 18175698900, 'USA'] 6. Ngender determines gender by name: repo: observerss/ngender

Probability based on naive Bayesian calculation

Pip install ngender > > import ngender > > ngender.guess ('Zhao Benshan') ('male', 0.9836229687547046) > ngender.guess (' Song Dandan') ('female', 0.9759486128949907) 7. Extract the regular expression of email

Integrated into python package cocoNLP

Email_pattern ='^ [* #\ u4e00 -\ u9fa5 a-zA-Z0-9room.-] + @ [a-zA-Z0-9 -] + (\ .[ a-zA-Z0-9 -] +) *\. [a-zA-Z0-9] {2lle 6} $'emails = re.findall (email_pattern, text, flags=0) 8. Extract the regular expression of phone_number

Integrated into python package cocoNLP

Cellphone_pattern ='^ (13 [0-9]) | (14 [0-9]) | (15 [0-9]) | (17 [0-9]) | (18 [0-9]))\ d {8} $'phoneNumbers = re.findall (cellphone_pattern, text, flags=0) 9. The regular expression IDCards_pattern = r'^ ([1-9]\ d {5} [12]\ d {3} (0 [1-9] | 1 [012])) (0 [1-9] | [12] [0-9] | 3 [01])\ d {3} [0-9xX]) $'IDs = re.findall (IDCards_pattern, text, flags=0) 10. Person name corpus:

Repo: wainshine/Chinese-Names-Corpus

The name extraction function has been added to python package cocoNLP.

Chinese (modern, ancient) name, Japanese name, Chinese surname and first name, address (period, aunt, etc.), English-> Chinese name (John Lee), idiom dictionary

(can be used for Chinese word segmentation and name recognition)

11. Repo: zhangyics/Chinese-abbreviation-dataset National people's Congress: national people's Congress / n China: people's Republic of China / ns Women's Tennis match: women's / n Tennis match / vn12. Chinese word-splitting Dictionary: repo: kfcd/chaizi word splitting method (1) disassembly method (2) disassembly method (3) reprimand only 13. Vocabulary emotional value: repo: rainarch/SentiBridge mountain spring water is abundant 0.400704566541 0.370067395878 broad vision 0.305762728932 0.325320747491 Grand Canyon adventure 0.312137906517 0.37859495728114. Chinese lexicon, discontinued words, sensitive words repo: dongxiexidian/Chinese

The sensitive thesaurus of this package is classified in more detail:

Reactionary thesaurus, sensitive thesaurus statistics, violent terror thesaurus, livelihood thesaurus, pornographic thesaurus

15. Chinese character to Pinyin: repo: mozillazg/python-pinyin

Text error correction will be used

16. Chinese simplified version: repo: skydark/nstools17. English simulated Chinese pronunciation engine repo: tinyfool/ChineseWithEnglishsay wo I ni# said: I love you

It is equivalent to using English phonetic symbols to simulate Chinese pronunciation.

18. Synonym, antonym, negative thesaurus: repo: guotong1988/chinese_dictionary19. Chinese character data repo: skishore/makemeahanzi

Stroke order of simplified / traditional Chinese characters

Vector stroke

20. Words are segmented and extracted without spaces: repo: keredson/wordninja > import wordninja > wordninja.split ('derekanderson') [' derek', 'anderson'] > wordninja.split (' imateapot') ['im',' a', 'teapot'] 21. IP address regular expression: (25 [0-5] | 2 [0-4]\ d | [0-1]\ d {2} | [1-9]?\ d)\. (25 [0-5] | 2 [0-4]\ d | [0-1]\ d {2} | [1-9]?\ d)\. (25 [0-5] | 2 [0-4]\ d | [0-1]\ d {2} | [1-9]?\ d)\. (25 [0-5] | 2 [0-4]\ d | [0-1]\ d {2} | [1-9]?\ d) 22. Tencent QQ regular expression: [1-9] ([0-9] {5jue 11}) 23. Regular expression of domestic fixed-line number: [0-9-() ()] {7pc18} 24. Regular expression of user name: [A-Za-z0-9 _\ -\ u4e00 -\ u9fa5] + 25. G2pC: context-based Chinese pronunciation automatic marking module repo: Kyubyong/g2pC26. Time extraction:

Integrated into python package cocoNLP

The test was performed 9:44 on June 7, 2016, and the results are as follows: Hi,all. Meeting at 3 p.m. Next Monday > > 2016-06-13 15:00:00-false Monday meeting > > 2016-06-13 00:00:00-true meeting next Monday > > 2016-06-20 00:00:00-true

Java version:

Https://github.com/shinyke/Time-NLP

Python version:

Https://github.com/zhanzecheng/Time_NLP

twenty-seven。 Rapid conversion of "Chinese numerals" and "Arabic numerals" repo: HaveTwoBrush/cn2an

Conversion between Chinese and Arabic numerals

The mixture of Chinese and Arabic numerals is under development.

twenty-eight。 Company name repo: wainshine/Company-Names-Corpus29. Repo: panhaiqi/AncientPoetry of the lexicon of ancient poetry

A more complete lexicon of ancient poetry:

Https://github.com/chinese-poetry/chinese-poetry

thirty。 THU's Thesaurus repo: http://thuocl.thunlp.org/

It has been sorted into the data folder of this repo.

IT thesaurus, financial thesaurus, idiom thesaurus, toponymic thesaurus, historical celebrity thesaurus, poetry thesaurus, medical thesaurus, diet thesaurus, legal thesaurus, automobile thesaurus, animal thesaurus 31. PDF table data extraction tool

Repo: camelot-dev/camelot

thirty-two。 Domestic telephone number regular matching (three major operators + virtual, etc.) repo: VincentSit/ChinaMobilePhoneNumberRegex33. User name blacklist list: repo: marteinn/The-Big-Username-Blacklist

Contains a disabled list of user names, such as:

Administratoradministrationautoconfigautodiscoverbroadcasthostdomaineditorguesthosthostmasterinfokeybase.txtlocaldomainlocalhostmastermailmail0mail34. Microsoft multilingual numbers / units / such as date and time identification package: repo: Microsoft/Recognizers-Text35. Chinese-xinhua Chinese Xinhua Dictionary Database and api, including commonly used Xiehouyu, idioms, words and Chinese characters repo: pwxcoo/chinese-xinhua36. Automatic generation of repo: liuhuanyong/TextGrapher for document atlas

TextGrapher-Text Content Grapher based on keyinfo extraction by NLP method. Input a document, extract the key information of the document, structure it, and finally organize it into a graph organization form to form a graphic display of the semantic information of the article.

thirty-seven。 The digital name library repo: google/UniNum38 in 186 languages. Complex and simplified conversion repo: berniey/hanziconv39. Chinese character feature extractor (featurizer), extract Chinese character features (pronunciation features, glyph features) for deep learning features repo: howl-anderson/hanzi_char_featurizer40. Chinese abbreviated dataset repo: zhangyics/Chinese-abbreviation-dataset41. No Tao Dictionary-the command line version of the youdao dictionary that supports English-Chinese mutual search and online query repo: ChestnutHeng/Wudao-dict42. Best Chinese numerals (Chinese numerals)-Arabic numeral conversion tool repo: Wall-ee/chinese2digits43. LineFlow: an efficient NLP data loader repo: tofunlp/lineflow44 for all deep learning frameworks. Parse the natural language number string into integer and floating point number repo: jaidevd/numerizer45. Repo: zacanger/profane-words these are the NLP open source dictionaries and tools that Xiaobian shared with you. If you happen to have similar doubts, please refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.