In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
In this issue, the editor will bring you about how to use Lucene to achieve a Chinese word separator. The article is rich in content and analyzes and describes for you from a professional point of view. I hope you can get something after reading this article.
What is a Chinese word Separator
Anyone who has studied English knows that English is based on words, and words are separated by spaces or commas.
The semantics of Chinese is so special that it is difficult to divide a Chinese character into one Chinese character as in English.
Therefore, we need a word separator which can automatically recognize Chinese semantics.
StandardAnalyzer:
Lucene's own Chinese word Separator
Word segmentation: word segmentation according to Chinese word by word. Such as: "I love China"
Effect: "I", "Love", "China", "country".
CJKAnalyzer
Dichotomy: segmenting into two words. Such as: "I am Chinese", the effect: "I am", "is Chinese", "Chinese" and "Chinese".
The above two word splitters cannot meet the demand.
Use the Chinese word splitter IKAnalyzer
IKAnalyzer inherits Lucene's Analyzer abstract class and uses IKAnalyzer and Lucene's built-in parser method to change the Analyzer test code to IKAnalyzer to test the effect of Chinese word segmentation.
If you use the Chinese word splitter ik-analyzer, use the same word splitter ik-analyzer in the index and search programs.
Using luke to Test IK Chinese word Segmentation
(1) Open Luke and do not specify the Lucene directory. Otherwise, there will be no effect.
(2) in the separator column, enter the full path of IkAnalyzer manually.
Org.wltea.analyzer.lucene.IKAnalyzer
Transform the code and use IkAnalyzer as a word splitter
Add jar package
Modify the splitter code
/ / create a Chinese word divider
Analyzer analyzer = new IKAnalyzer ()
Expand the Chinese thesaurus
Expand the role of the thesaurus: retain the defined words in the process of word segmentation
1 set up your own extended thesaurus and mydict.dic files in src or other source directories, such as:
2 set up your own disabled thesaurus and ext_stopword.dic files in src or other source directories
The role of discontinued words: in the process of word segmentation, the participle will ignore these words.
3. Create an IKAnalyzer.cfg.xml under src or other source directories, as follows (note the corresponding path):
IK Analyzer extended configuration
Mydict.dic
Ext_stopword.dic
The above is the editor for you to share how to use Lucene to achieve a Chinese word separator, if you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.