Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to realize the lucene4.7 word Separator

2025-03-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly explains "how to implement the lucene4.7 word splitter". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Now let the editor take you to learn "how to implement the lucene4.7 word splitter"!

First of all, the first problem that must be solved in front of us is the problem of Chinese word segmentation, because Lucene is, after all, developed by foreign bulls, and it is obvious that it will focus on English articles, but fortunately, in the Lucene download package synchronized the SmartCN word separator for Chinese release, every time Lucene has a new version released, this package is updated at the same time.

The Chinese word divider recommended by the author is the IK word divider. Before entering the formal explanation, let's first understand several built-in parsers in Lucene.

The basic introduction to the type of parser WhitespaceAnalyzer uses spaces as the word segmentation standard, does not standardize the vocabulary unit by other normalization processing, SimpleAnalyzer divides the text information with non-alphabetic characters, unifies the vocabulary unit into lowercase form, and removes the numeric type of characters StopAnalyzer. This analyzer removes some common characters such as the aQuery, etc., and can also customize the standard parser built-in to disable the word StandardAnalyzerLucene, which converts the vocabulary unit into lowercase form. And remove the word separator which can analyze Chinese, Japanese and Korean languages with stop words and punctuation marks CJKAnalyzer, and the effect of Chinese support is general. SmartChineseAnalyzer supports Chinese slightly better, but its scalability is poor.

To evaluate the performance of a word splitter, the key is to look at its word-cutting efficiency, flexibility, and expansibility. Usually, a good Chinese word splitter should have extended thesaurus, disabled thesaurus and synonym. Of course, the most important thing is to be in line with our own business, because sometimes we don't need some custom thesaurus, so we don't have to consider this when choosing a word splitter. The latest version of IK Separator released on IK's official website has good support for Lucene, but the support for solr is not good enough. You need to change the source code to support the version of solr4.x. The author uses another IK package that has been modified by some people to support solr4.3, and to expand the thesaurus, disable the thesaurus, and fully support the thesaurus, and it is very easy to configure in solr, as long as you need to configure simply in schmal.xml, you can use the powerful customization features of the IK word splitter. However, the IK package released by the author of IK on the official website does not support the extension of the thesaurus in lucene. If you want to use it, you need to modify the source code yourself, but it is very easy to modify the extended synonyms yourself.

Below, the author gives a test in Lucene using the IK released on the last version of the official website, which has been extended to the thesaurus.

Let's take a look at the first test of pure word segmentation.

Package com.ikforlucene;import java.io.StringReader;import org.apache.lucene.analysis.TokenStream;import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;public class Test {public static void main (String [] args) throws Exception {/ / the following participle is a modified participle IKSynonymsAnalyzer analyzer=new IKSynonymsAnalyzer () that supports synonyms; String text= "Sanjiesanxian is a rookie" TokenStream ts=analyzer.tokenStream ("field", new StringReader (text)); CharTermAttribute term=ts.addAttribute (CharTermAttribute.class); ts.reset (); / / reset prepare while (ts.incrementToken ()) {System.out.println (term.toString ());} ts.end (); / / ts.close () / / close the stream}}

Running result:

Sanjiaxian is a rookie.

The second step is to test and expand the thesaurus to make the three robberies into one word and the Sanxian into one word, which needs to be added to the thesaurus. Sanxian (note that it is read by line), pay attention to save the format as UTF-8 or no BOM format.

After adding the extended thesaurus, the running result is as follows:

Sanjiaxian is a rookie.

The third step is to test the disabled thesaurus. We block out the word Cainiao, one word per line and the format is the same as above.

After adding the disabled thesaurus, the result is as follows:

Sanjia Sanxian is a

Finally, let's test the synonyms. Now the author adds Henan people and Luoyang people as synonyms of "one" to the thesaurus (the author is only doing a test here. Synonyms in the real production environment must be formal), pay attention to synonyms, which are also read by lines, and each line of synonyms is separated by commas.

After adding the thesaurus, the running result is as follows:

Sanjia Sanxian is a native of Henan and Luoyang.

So far, most of the functions of using IK in Lucene4.3 have been tested, and the source code of the extended synonym part is given below, which can be used for reference by interested Taoists.

Package com.ikforlucene;import java.io.IOException;import java.io.Reader;import java.util.HashMap;import java.util.Map;import org.apache.lucene.analysis.Analyzer;import org.apache.lucene.analysis.Tokenizer;import org.apache.lucene.analysis.synonym.SynonymFilterFactory;import org.apache.solr.core.SolrResourceLoader;import org.wltea.analyzer.lucene.IKTokenizer / * you can load the Lucene * special IK word splitter of the thesaurus * * / public class IKSynonymsAnalyzer extends Analyzer {@ Override protected TokenStreamComponents createComponents (String arg0, Reader arg1) {Tokenizer token=new IKTokenizer (arg1, true); / / enable intelligent word segmentation Map paramsMap=new HashMap (); paramsMap.put ("luceneMatchVersion", "LUCENE_43") ParamsMap.put ("synonyms", "E:\\ synonym\\ synonyms.txt"); SynonymFilterFactory factory=new SynonymFilterFactory (paramsMap); SolrResourceLoader loader= new SolrResourceLoader (""); try {factory.inform (loader);} catch (IOException e) {/ / TODO Auto-generated catch block e.printStackTrace () } return new TokenStreamComponents (token, factory.create (token));}} at this point, I believe you have a deeper understanding of "how to implement the lucene4.7 word splitter". You might as well do it in practice! Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report