How to implement a Chinese word Separator with Lucene 04/16 Update SLTechnology News&Howtos

How to implement a Chinese word Separator with Lucene

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

In this issue, the editor will bring you about how to use Lucene to achieve a Chinese word separator. The article is rich in content and analyzes and describes for you from a professional point of view. I hope you can get something after reading this article.

What is a Chinese word Separator

Anyone who has studied English knows that English is based on words, and words are separated by spaces or commas.

The semantics of Chinese is so special that it is difficult to divide a Chinese character into one Chinese character as in English.

Therefore, we need a word separator which can automatically recognize Chinese semantics.

StandardAnalyzer:

Lucene's own Chinese word Separator

Word segmentation: word segmentation according to Chinese word by word. Such as: "I love China"

Effect: "I", "Love", "China", "country".

CJKAnalyzer

Dichotomy: segmenting into two words. Such as: "I am Chinese", the effect: "I am", "is Chinese", "Chinese" and "Chinese".

The above two word splitters cannot meet the demand.

Use the Chinese word splitter IKAnalyzer

IKAnalyzer inherits Lucene's Analyzer abstract class and uses IKAnalyzer and Lucene's built-in parser method to change the Analyzer test code to IKAnalyzer to test the effect of Chinese word segmentation.

If you use the Chinese word splitter ik-analyzer, use the same word splitter ik-analyzer in the index and search programs.

Using luke to Test IK Chinese word Segmentation

(1) Open Luke and do not specify the Lucene directory. Otherwise, there will be no effect.

(2) in the separator column, enter the full path of IkAnalyzer manually.

Org.wltea.analyzer.lucene.IKAnalyzer

Transform the code and use IkAnalyzer as a word splitter

Add jar package

Modify the splitter code

/ / create a Chinese word divider

Analyzer analyzer = new IKAnalyzer ()

Expand the Chinese thesaurus

Expand the role of the thesaurus: retain the defined words in the process of word segmentation

1 set up your own extended thesaurus and mydict.dic files in src or other source directories, such as:

2 set up your own disabled thesaurus and ext_stopword.dic files in src or other source directories

The role of discontinued words: in the process of word segmentation, the participle will ignore these words.

3. Create an IKAnalyzer.cfg.xml under src or other source directories, as follows (note the corresponding path):

IK Analyzer extended configuration

Mydict.dic

Ext_stopword.dic

The above is the editor for you to share how to use Lucene to achieve a Chinese word separator, if you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.