Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to realize the highlight function of lucene4.7

2025-01-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "how to achieve lucene4.7 highlighting function". In daily operation, I believe many people have doubts about how to achieve lucene4.7 highlighting function. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful for you to answer the doubts about "how to achieve lucene4.7 highlighting function". Next, please follow the editor to study!

Highlight function has always been a very excellent module of full-text retrieval, in a standard search engine, highlight the return of hit results, is almost an essential requirement, because through highlighting, we can quickly mark users' search keywords on our search interface, thus reducing the number of users looking for the desired results, and greatly improving the experience and friendliness of users to a certain extent.

First of all, I still like platitudes to supplement the familiar basic knowledge needed for highlighting, of course, if you just need to achieve the effect, and do not pay attention to its underlying API, then you can ignore this part, but still want a friendly hint, if there is a small problem in the process of use, it will not be API, but it will not be easy to solve, unless you are willing to a variety of google.

To use highlighting, you have to start with the index, because you need highlighted fields, you need to accurately obtain location information, as well as some offsets, if the information is not accurate, then there may be some inexplicable mislocations in the results, reflecting that the page is marked with words that should not be marked, but not the content of the mark, so you still need to pay attention to this. We need to use the item vector to record the location information of each token, which is very simple, the code is as follows:

FieldType type=new FieldType (TextField.TYPE_STORED); type.setStoreTermVectorOffsets (true); / / record relative increment type.setStoreTermVectorPositions (true); / / record location information type.setStoreTermVectors (true); / / store vector information type.freeze (); / / prevent change information Field field=new Field ("field name", "value", type); / / example

To put it simply, the meaning of the two enumerated variables of TextField

Variable name definition TYPE_NOT_STORED index, participle, do not store TYPE_STORED index, participle, storage

From this point of view, the need to highlight the content, must be stored, there may be some larger text, will compare the index space, thus affecting retrieval performance, of course, we can also use external storage, relational database, nosql and so on, at this time, highlighting may need to do some other processing.

Let's take a look at some of the basic classes that need to be highlighted.

Class definition SimpleHTMLFormatter commonly used formatted Html tag, provides a constructor to pass in a highlighted color label, provides a static method using black TokenSources by default, and supports getting TokenStream from a data source. Token processing Highlighter is responsible for obtaining the highlighted fragments on the match QueryScorer to score the hit results Fragmenter splits the original string into independent fragments NullFragmenter to highlight the shorter fields as a whole and provides some implementation classes based on fast highlighting Encoder Operations on html text, such as removing some special matching symbols and so on, and some other non-ASCII special characters.

Let's first take a look at several test data of Sanxian:

Id:1 name: China is a great country, we Chinese are all good. Haha, China will always be a strong content: Hello people id:2 name: we have a home whose name is China content: China's Land, Rich id:3 name: our China Our land is the content of the people's hope: if you don't generate some fields in the clip, id:4 name: what are you doing at this time in 2014? content: , farmers work hard to hoe at noon id:5 name: who do you think of when you are lonely, do you want to find someone to accompany content: I am never alone

1. Test the normal highlighted core code:

String filed= "name"; QueryParser query=new QueryParser (Version.LUCENE_44, filed, new IKAnalyzer (false)); Query q=query.parse ("Great China"); / / Test fields TopDocs top=searcher.search (Q100); QueryScorer score=new QueryScorer (Q, filed); / / incoming score SimpleHTMLFormatter fors=new SimpleHTMLFormatter (","); / / Custom highlight label Highlighter highlighter=new Highlighter (fors,score) / / highlight parser / / highlighter.setMaxDocCharsToAnalyze (1); / / set the number of characters highlighted for (ScoreDoc sd:top.scoreDocs) {Document doc=searcher.doc (sd.doc); String name=doc.get (filed); TokenStream token=TokenSources.getAnyTokenStream (searcher.getIndexReader (), sd.doc, filed, new IKAnalyzer (true)); / / get tokenstream Fragmenter fragment=new SimpleSpanFragmenter (score) Highlighter.setTextFragmenter (fragment); String str=highlighter.getBestFragment (token, name); / / get highlighted clips, which can be limited by System.out.println ("highlighted clips = >" + str);}

The output is as follows

China is a great country, we Chinese are all good , China will always be a strong highlight segment = > our land is the hope of the people = > We have a home whose name is China.

2, fast highlighting, FastVectorHighlighter, this class may consume more storage space in exchange for better performance, of course, in addition to the performance improvement, it also has a very cool feature, supporting multiple color tags, highlighting keywords, in addition to supporting Ngram domains, and intelligent merging of adjacent highlighted phrases.

Let's take a look at three test data that Sanxian quickly highlighted:

Id:2 name: China (China), located in East Asia, is a unified multi-ethnic country with Chinese civilization as the main body, Chinese culture as the basis and the Han nationality as the main race. All the ethnic groups in China are collectively referred to as the Chinese nation, and the dragon is the symbol of the Chinese nation. Content: China is one of the four ancient civilizations in the world, with a long history. About 5000 years ago, settlement organizations began to appear with the Central Plains as the center, and then became countries and dynasties, and then experienced many changes and dynasties. The dynasties that lasted for a long time were Xia, Shang, Zhou, Han, Jin, Tang, Song, Yuan, Ming and Qing id:1 name: China has been a very great nation since ancient times: China is a populous country in the world, with a population of more than 1.3 billion. ID: 3 name: rootless weeds, erratic fate content: who is like you as my treasure, do everything, old love count a piece of cloth Write a full stop at this moment, just want to grow old with you.

The core code is as follows:

Query q=query.parse ("the Great Chinese Nation"); TopDocs top=searcher.search (Q100); / / QueryScorer score=new QueryScorer (Q, filed); / / SimpleHTMLFormatter fors=new SimpleHTMLFormatter (","); / / Custom highlight tag / / Highlighter highlighter=new Highlighter (fors,score); / / highlight Analyzer / / FastVectorHighlighter fastHighlighter=new FastVectorHighlighter (); FragListBuilder fragListBuilder=new SimpleFragListBuilder () / / Note that in the constructor below, a color array is used to support multiple color highlights FragmentsBuilder fragmentsBuilder= new ScoreOrderFragmentsBuilder (BaseFragmentsBuilder.COLORED_PRE_TAGS,BaseFragmentsBuilder.COLORED_POST_TAGS); FastVectorHighlighter fastHighlighter2=new FastVectorHighlighter (true, true, fragListBuilder, fragmentsBuilder); FieldQuery querys=fastHighlighter2.getFieldQuery (Q); / / reader is the incoming stream / / highlighter.setMaxDocCharsToAnalyze (1) / / set the number of characters to be highlighted: for (ScoreDoc sd:top.scoreDocs) {String snippt=fastHighlighter2.getBestFragment (querys, reader, sd.doc,filed,300); if (snipptfragments are null) {System.out.println ("highlighted clips are:" + snippt);}}

The results are as follows, note that there are multiple color identifiers:

The highlight segment is: China has been a very great nation since ancient times. China (China), located in East Asia, is a unified multi-ethnic country with Chinese civilization as the main body, Chinese culture as the basis, and the Han nationality as the main race. All the ethnic groups in China are collectively referred to as the Chinese nation, and the dragon is the symbol of the Chinese nation. The highlight is: rootless weeds, erratic fate

3. Next, let's focus on the third way of highlighting, the foreground highlights, Sanxian mentioned above, based on the highlighted fields, must be stored, otherwise it is impossible to achieve highlighting, of course, this view is only for background highlighting, then for large text, storage in the index is a very waste of space, and may also affect the retrieval speed, so put forward a third way.

Highlight in the foreground, and then a large text field can be stored in other external data sources. When you need a tag, you can read the data directly according to ID, or a field, and then use JS regularization to replace the retrieved keywords at the front end. Before that, you need to use ajax to put the retrieved keywords into the background for word segmentation, and then return the results to the foreground for data after segmentation. Matching and replacement, coupled with color marking, can achieve highlighting in the foreground, which is also the implementation principle of foreground highlighting. In some business scenarios, this approach can greatly reduce server pressure, reduce pressure through the client, and no longer have to store some vector information, which is also of great help to improve the performance of the system.

A highlighted screenshot of the foreground is shown below, with a fast highlighted index.

Attach the core code highlighted by the front desk

Ajax ({type: "post", url: "getContent", data: "str=" + str, dataType: "json", async:false, success:function (msg) {/ / alert (msg); $("# div"). Empty () $.each (msg, function (I, n) {var temp= ""; for (var item0)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report