Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to imitate Baidu with Lucene.net full-text Retrieval

2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article shows you how to use Lucene.net full-text search to achieve imitation of Baidu, the content is concise and easy to understand, absolutely can make your eyes bright, through the detailed introduction of this article, I hope you can get something.

Lucene.Net

Lucene.net is a. Net portable version of Lucene, is an open source full-text search engine development kit, that is, it is not a complete full-text search engine, but a full-text search engine architecture, is a Library. You can also think of it as a set of easy-to-use API that encapsulates indexing and search functions (provides a complete query engine and index engine). You can do a lot of things about search with this set of API, and it's very convenient. Developers can implement full-text retrieval based on Lucene.net.

Note: Lucene.Net can only retrieve text information. If it is not a text message, to convert it to text information, such as to retrieve an Excel file, read the Excel into a string using NPOI, and then throw the string to Lucene.Net. Lucene.Net will save the text thrown to it, speeding up the retrieval speed.

For more conceptual knowledge, please refer to this blog post: http://blog.csdn.net/xiucool/archive/2008/11/28/3397182.aspx

This little Demo sample shows:

Ok, then explain in detail how Corporal achieved this effect step by step.

Lucene.Net Core-word Segmentation algorithm (Analyzer)

To learn Lucune.Net, participle is the core. Of course, ideally, you can expand your own word segmentation, but this requires a high algorithm. Different word segmentation algorithms in Lucene.Net are different classes. All word segmentation algorithm classes inherit from Analyzer class, and different word segmentation algorithms have different advantages and disadvantages.

The built-in StandardAnalyzer is to segment English words according to spaces, punctuation marks, etc., and Chinese words according to a single word, and each Chinese character is regarded as a word.

Analyzer analyzer = new StandardAnalyzer (); TokenStream tokenStream = analyzer.TokenStream ("Hello Lucene.Net, I love you China"); Lucene.Net.Analysis.Token token = null; while ((token = tokenStream.Next ())! = null) {Console.WriteLine (token.TermText ());}

The result after word segmentation:

Binary word segmentation algorithm, every two Chinese characters calculate a word, "I love you China" will be divided into "I love you china", click to view the binary word segmentation algorithm CJKAnalyzer.

Analyzer analyzer = new CJKAnalyzer (); TokenStream tokenStream = analyzer.TokenStream ("I love you, China Republic of China"); Lucene.Net.Analysis.Token token = null; while ((token = tokenStream.Next ())! = null) {Response.Write (token.TermText () + ");}

At this time, you must be thinking, there is not a good, binary word segmentation algorithm shooting birds, would like to expand their own Analyzer, but not the algorithm of the professional. What shall I do?

Heavenly holy artifact, Pangu participle, click download.

A brief introduction to the Core Class of Lucene.Net (1)

Directory represents the place where the index file (where Lucene.net stores the data thrown by the user) is the abstract class, two subclasses FSDirectory (in the file) and RAMDirectory (in memory).

The class that IndexReader reads the index and writes to IndexWriter.

IndexReader's static method bool IndexExists (Directory directory) determines whether the directory directory is an index directory. IndexWriter's bool IsLocked (Directory directory) determines whether the directory is locked and locks the directory before writing to it. Two IndexWriter cannot write an index file at the same time. IndexWriter automatically locks when writing and unlocks automatically when close. The IndexWriter.Unlock method is unlocked manually (for example, the close IndexWriter program crashes before it has time, possibly causing it to be locked all the time).

Create index library operation:

Constructor: IndexWriter (Directory dir, Analyzer a, bool create, MaxFieldLength mfl) because when IndexWriter writes the input to the index, Lucene.net divides the article with the specified word separator (so that it can be looked up quickly when searching), and then puts the word into the index file.

Void AddDocument (Document doc) to add a document (Insert) to the index. The Document class represents the document (article) to be indexed, and the most important method, Add (Field field), adds fields to the document. Document is a piece of document and Field is a field (property). Document is equivalent to a record and Field is equivalent to a field.

The constructor of Field class Field (string name, string value, Field.Store store, Field.Index index, Field.TermVector termVector): name represents the field name; value represents the field value; store indicates whether to store the value value. Optional values are stored in Field.Store.YES, Field.Store.NO is not stored, and Field.Store.COMPRESS is compressed. By default, only a pile of words after the participle is saved, but not the content before the participle. The search cannot restore the original text according to what is after the participle, so if you want to display the original text (such as the text of the article), you need to set up storage. Index indicates how to create an index. The optional value is Field.Index. NOT_ANALYZED, do not create index, Field.Index. ANALYZED, create the index; only the fields that create the index can be better retrieved. Whether or not to break up thousands of pieces! Whether you need to follow this field for full-text search. TermVector shows how to preserve the distance between index words. "Beijing welcomes all of you", and how many words are saved between "Beijing" and "everyone" in the index. It is convenient to retrieve only words within a certain distance.

Private void CreateIndex () {/ / the index library is stored in this folder string indexPath = ConfigurationManager.AppSettings ["pathIndex"]; / / Directory represents the place where the index file is stored, which is an abstract class, two subclasses FSDirectory represent the file, and RAMDirectory means stored in memory FSDirectory directory = FSDirectory.Open (new DirectoryInfo (indexPath), new NativeFSLockFactory ()) / / determine whether the directory directory is an index directory. Bool isUpdate = IndexReader.IndexExists (directory); logger.Debug ("Index Library existence status:" + isUpdate); if (isUpdate) {if (IndexWriter.IsLocked (directory)) {IndexWriter.Unlock (directory) }} / / the third parameter is whether to create an index folder, Bool Create. If it is True, the newly created index will overwrite the original index file, otherwise, you don't have to create it and update it. IndexWriter write = new IndexWriter (directory, new PanGuAnalyzer (),! isUpdate, IndexWriter.MaxFieldLength.UNLIMITED); WebClient wc = new WebClient (); / / coding to prevent garbled wc.Encoding = Encoding.UTF8; int maxID Try {/ / read rss, get the number part of the link in the first item is the largest post number maxID = GetMaxID ();} catch (WebException webEx) {logger.Error ("error getting maximum post number", webEx) Return;} for (int I = 1; I 0) {foreach (DataRow row in dt.Rows) {Model.SearchSum oneModel=new Model.SearchSum (); oneModel.Keyword = Convert.ToString (row ["keyword"]) OneModel.SearchCount = Convert.ToInt32 (row ["SearchCount"]); list.Add (oneModel);}} return list;}

Search suggestions, similar to Baidu search when the drop-down prompt box, Jquery UI simulation, the following is to get the largest number of searches to sort, get the IEnumerable collection

Public IEnumerable GetSuggestion (string kw) {DataTable dt = SqlHelper.ExecuteDataTable (@ "select top 5 Keyword,count (*) as searchcount from keywords where datediff (day,searchdatetime) Getdate ()--% > The above content is how to imitate Baidu with Lucene.net full-text retrieval. Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report