Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use and optimize lucene and lucene.NET

2025-01-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)06/01 Report--

Today, I will talk to you about how to use and optimize lucene and lucene.NET. Many people may not know much about it. In order to make you understand better, the editor has summarized the following contents for you. I hope you can get something according to this article.

1 introduction to lucene

1.1 what is lucene

Lucene is a full-text search framework, not an application product. So it's not as easy to use as www.baidu.com or google Desktop, it just provides a tool for you to implement these products.

1.2 what can lucene do?

To answer this question, you must first understand the nature of lucene. In fact, the function of lucene is very simple, in the final analysis, you give it several strings, and then it provides you with a full-text search service that tells you where the keywords you want to search appear. Knowing this nature, you can use your imagination to do anything that meets this condition. You can index all the news on the site and make a database; you can index several fields of a database table, so you don't have to worry about locking the table because of "% like%"; you can also write your own search engine.

1.3 should you choose lucene?

Here are some test data that you can choose if you find it acceptable.

Test 1: 2.5 million records, about 300m text, generate index about 380m, 800 threads under the average processing time 300ms.

Test 2: 37000 records, two varchar fields in the index database, and the average processing time 1.5ms under the index file 2.6m _ (800) thread.

2 the way lucene works

The service provided by lucene actually consists of two parts: one in and one out. The so-called input is writing, that is, the source you provide (essentially a string) is written to the index or deleted from the index; the so-called read-out is to provide users with full-text search services, so that users can locate the source through keywords.

2.1 write proc

The source string is first processed by analyzer, including: word segmentation, divided into words; removal of stopword (optional).

Add the information needed in the source to each Field of the Document, index the Field that needs to be indexed, and store the Field that needs to be stored.

The index is written to memory, which can be memory or disk.

2.2 readout process

Users provide search keywords, which are processed by analyzer.

Search the index of the processed keywords to find the corresponding Document.

The user extracts the desired Field from the found Document as needed.

3 some concepts that need to be known

Lucene uses some concepts to understand their meaning, which is helpful to the following explanation.

3.1 analyzer

Analyzer is an analyzer, its function is to divide a string into words according to certain rules, and remove the invalid words. The invalid words here refer to "of" and "the" in English, and "de" and "Di" in Chinese. These words appear in large numbers in the article, but they do not contain any key information. Removing them will help to reduce the index file, improve efficiency and improve the hit rate.

The rules of word segmentation are ever-changing, but they have only one purpose: to divide them according to semantics. This is easier to achieve in English, because English itself is based on words and has been separated by spaces, while Chinese must somehow divide sentences into words. The specific division method will be described in detail below, here you only need to understand the concept of the analyzer.

3.2 document

The sources provided by the user are records, which can be a text file, a string, a record of a database table, and so on. After a record is indexed, it is stored in the index file in the form of a Document. The user searches and returns in the form of a Document list.

3.3 field

A Document can contain multiple information fields, for example, an article can contain "title", "body", "last modified time" and other information fields, which are stored in the Document through Field.

Field has two properties to choose from: storage and indexing. You can control whether the Field is stored by storing properties; you can control whether the Field is indexed by indexing properties. This may seem like some nonsense, but in fact it is important for the correct combination of these two attributes. Here is an example:

Taking the article just now as an example, we need to do a full-text search for the title and body, so we need to set the index property to true, and we want to be able to extract the article title directly from the search results. so we set the storage property of the title field to true, but because the text field is too large, we set the storage property of the text field to false in order to reduce the size of the index file. Read the file directly when needed We just want to be able to extract the last modification time from the search results without searching it, so we set the storage property of the last modification time domain to true and the index property to false. The above three fields cover three combinations of two properties, and one that is all false is not used. In fact, Field does not allow you to set that, because fields that are neither stored nor indexed are meaningless.

3.4 term

Term is the smallest unit of search. It represents a word in a document. Term consists of two parts: the word it represents and the field that appears in that word.

3.5 tocken

Tocken is an occurrence of term that contains trem text and the corresponding start and end offset, as well as a type string. The same word can appear many times in a sentence, all represented by the same term, but with a different tocken, each tocken marking where the word appears.

3.6 segment

When you add an index, not every document is immediately added to the same index file. They are first written to different small files, and then merged into a large index file, where each small file is a segment.

4 the structure of lucene

Lucene includes two parts: core and sandbox, in which core is the core part of lucene stability, and sandbox contains some additional functions, such as highlighter and various analyzers.

Lucene core has seven packages: analysis,document,index,queryParser,search,store,util.

4.1 analysis

Analysis includes some built-in parsers, such as WhitespaceAnalyzer for word segmentation by white space characters, StopAnalyzer with stopwrod filtering, and the most commonly used StandardAnalyzer.

4.2 document

Document contains the data structure of the document, for example, the Document class defines the data structure in which the document is stored, and the Field class defines a domain of the Document.

4.3 index

Index contains the reading and writing classes of the index, such as the IndexWriter class that writes, merges and optimizes the segment of the index file, and the IndexReader class that reads and deletes the index. what we should pay attention to here is not to be misled by the name IndexReader, thinking that it is the reading class of the index file. In fact, deleting the index is also done by it. IndexWriter only cares about how to write the index to each segment and merge and optimize them. IndexReader focuses on the organization of individual documents in the index file.

4.4 queryParser

QueryParser contains classes of parsing query statements, lucene query statements and sql statements are somewhat similar, there are a variety of reserved words, according to a certain syntax can form a variety of queries. Lucene has many Query classes, all of which are inherited from Query and execute various special queries. The function of QueryParser is to parse query statements and call various Query classes sequentially to find out the results.

4.5 search

Search contains various classes that search for results from the index, such as the various Query classes just mentioned, including TermQuery, BooleanQuery, and so on, in this package.

4.6 store

Store contains storage classes for indexes. For example, Directory defines the storage structure of index files, FSDirectory is the index stored in the file, RAMDirectory is the index stored in memory, and MmapDirectory is the index that uses memory mapping.

4.7 util

Util contains some common utility classes, such as conversion tools between time and strings.

5 how to build an index

5.1 the simplest code snippet that can complete the index

IndexWriter writer = new IndexWriter ("/ data/index/", new StandardAnalyzer (), true)

Document doc = new Document ()

Doc.add (new Field ("title", "lucene introduction", Field.Store.YES, Field.Index.TOKENIZED))

Doc.add (new Field ("content", "lucene works well", Field.Store.YES, Field.Index.TOKENIZED))

Writer.addDocument (doc)

Writer.optimize ()

Writer.close ()

Let's analyze this code.

First, we create a writer, and specify the directory to store the index as "/ data/index", the parser used is StandardAnalyzer, and the third parameter states that if there are already index files in the index directory, we will overwrite them.

Then we create a new document.

We add a field to document with the name "title" and the content "lucene introduction", which is stored and indexed.

Add another field with the name "content" and the content "lucene works well", which is also stored and indexed.

Then we add this document to the index, and if there are multiple documents, we can repeat the above operation, create the document and add it.

After adding all the document, we optimize the index, which is mainly to merge multiple segment into one, which is helpful to improve the speed of the index.

It is important to shut down the writer later.

Yes, it's as simple as creating an index!

Of course, you may modify the above code to get a more personalized service.

5.2 write the index directly in memory

You need to first create a RAMDirectory and pass it to writer as follows:

Directory dir = new RAMDirectory ()

IndexWriter writer = new IndexWriter (dir, new StandardAnalyzer (), true)

Document doc = new Document ()

Doc.add (new Field ("title", "lucene introduction", Field.Store.YES, Field.Index.TOKENIZED))

Doc.add (new Field ("content", "lucene works well", Field.Store.YES, Field.Index.TOKENIZED))

Writer.addDocument (doc)

Writer.optimize ()

Writer.close ()

5.3 indexed text files

If you want to index plain text files instead of reading them into a string to create a field, you can create a field with the following code:

Field field = new Field ("content", new FileReader (file))

The file here is the text file. This constructor actually reads the contents of the file and indexes it, but does not store it.

6 how to maintain the index

Index maintenance operations are provided by the IndexReader class.

6.1 how to delete an index

Lucene provides two ways to delete document from the index. One is

Void deleteDocument (int docNum)

This method is based on the number of document in the index to delete, each document added to the index will have a unique number, so delete according to the number is a precise deletion, but this number is the internal structure of the index, we generally do not know the number of a file is in the end, so useful. The other is

Void deleteDocuments (Term term)

This method actually performs a search operation based on the parameter term, and then deletes the search results in batches. We can use this method to provide a strict query condition to achieve the purpose of deleting the specified document.

Here is an example:

Directory dir = FSDirectory.getDirectory (PATH, false)

IndexReader reader = IndexReader.open (dir)

Term term = new Term (field, key)

Reader.deleteDocuments (term)

Reader.close ()

6.2 how to update the index

Lucene does not provide a special method to update the index, we need to delete the corresponding document before adding the new document to the index. For example:

Directory dir = FSDirectory.getDirectory (PATH, false)

IndexReader reader = IndexReader.open (dir)

Term term = new Term ("title", "lucene introduction")

Reader.deleteDocuments (term)

Reader.close ()

IndexWriter writer = new IndexWriter (dir, new StandardAnalyzer (), true)

Document doc = new Document ()

Doc.add (new Field ("title", "lucene introduction", Field.Store.YES, Field.Index.TOKENIZED))

Doc.add (new Field ("content", "lucene is funny", Field.Store.YES, Field.Index.TOKENIZED))

Writer.addDocument (doc)

Writer.optimize ()

Writer.close ()

7 how to search

Lucene search is very powerful, it provides a lot of auxiliary query classes, each class inherits from the Query class, each completes a special query, you can build building blocks like any combination of them to complete some complex operations; in addition, lucene also provides the Sort class to sort the results, provides the Filter class to limit the query conditions. You might unconsciously compare it to a SQL statement: "can lucene perform and, or, order by, where, like'% xx%' operations?" The answer is: "of course!"

7.1Various Query

Let's take a look at what query operations lucene allows us to do:

7.1.1 TermQuery

Let's start with the most basic query. If you want to execute a query like this: "include 'lucene' 's document" in the content field, then you can use TermQuery:

Term t = new Term ("content", "lucene")

Query query = new TermQuery (t)

7.1.2 BooleanQuery

If you want to query "include the document of java or perl in the content field", you can create two TermQuery and connect them with BooleanQuery:

TermQuery termQuery1 = new TermQuery (new Term ("content", "java")

TermQuery termQuery 2 = new TermQuery (new Term ("content", "perl")

BooleanQuery booleanQuery = new BooleanQuery ()

BooleanQuery.add (termQuery 1, BooleanClause.Occur.SHOULD)

BooleanQuery.add (termQuery 2, BooleanClause.Occur.SHOULD)

7.1.3 WildcardQuery

If you want to query a word with wildcard characters, you can use WildcardQuery, which includes'?' Match an arbitrary character and'* 'match zero or more arbitrary characters. For example, if you search for' use*', you may find 'useful'' or 'useless':'.

Query query = new WildcardQuery (new Term ("content", "use*")

7.1.4 PhraseQuery

You may be interested in Sino-Japanese relations. If you want to find articles that are relatively close to 'China' and 'Japan' (within a distance of 5 words), do not consider those that exceed this distance. You can:

PhraseQuery query = new PhraseQuery ()

Query.setSlop (5)

Query.add (new Term ("content", "medium"))

Query.add (new Term ("content", "Day"))

Then it may find "Sino-Japanese cooperation." , "China and Japan." But there was no search for "a senior Chinese leader said that Japan wanted to be taught a lesson."

7.1.5 PrefixQuery

If you want to search for words that begin with 'medium', you can use PrefixQuery:

PrefixQuery query = new PrefixQuery (new Term ("content", "medium")

7.1.6 FuzzyQuery

FuzzyQuery is used to search for similar term, using the Levenshtein algorithm. Suppose you want to search for words similar to 'wuzza'', you can:

Query query = new FuzzyQuery (new Term ("content", "wuzza")

You may get 'fuzzy'' and 'wuzzy'.

7.1.7 RangeQuery

Another common Query is RangeQuery. You may want to search for document with a time range from 20060101 to 20060130. You can use RangeQuery:

RangeQuery query = new RangeQuery (new Term ("time", "20060101"), new Term ("time", "20060130"), true)

The final true is represented by a closed interval.

7.2 QueryParser

After watching so many Query, you may ask, "I won't be allowed to combine all kinds of Query by myself, it's too troublesome!" Of course not, lucene provides a query statement similar to SQL statement, let's call it lucene statement, through it, you can finish all kinds of queries in one sentence, and lucene will automatically check them into small blocks and give them to the corresponding Query for execution. Let's demonstrate each type of Query:

TermQuery can be in the "field:key" way, such as "content:lucene".

In BooleanQuery, 'and' use'+','or 'use', such as "content:java contenterl".

WildcardQuery still uses'?' And'*', such as "content:use*".

PhraseQuery uses'~', for example, "content:" China and Japan "~ 5".

PrefixQuery uses'*', such as' medium *'.

FuzzyQuery uses'~', such as "content: wuzza ~".

RangeQuery uses'[]'or'{}', the former for closed intervals and the latter for open intervals, such as "time: [20060101 TO 20060130]". Note that TO is case-sensitive.

You can combine query string arbitrarily to complete complex operations, such as "articles with title or body including lucene and time between 20060101 and 20060130" can be expressed as: "+ (title:lucene content:lucene) + time: [20060101 TO 20060130]". The code is as follows:

Directory dir = FSDirectory.getDirectory (PATH, false)

IndexSearcher is = new IndexSearcher (dir)

QueryParser parser = new QueryParser ("content", new StandardAnalyzer ())

Query query = parser.parse ("+ (title:lucene content:lucene) + time: [20060101 TO 20060130]"

Hits hits = is.search (query)

For (int I = 0; I < hits.length (); iTunes +)

{

Document doc = hits.doc (I)

System.out.println (doc.get ("title")

}

Is.close ()

First let's create an IndexSearcher on the specified file directory.

Then create a QueryParser that uses StandardAnalyzer as the parser, and its default search domain is content.

Then we use QueryParser to parse the query string to generate a Query.

Then use this Query to find the results, and the results are returned in the form of Hits.

This Hits object contains a list, and we display its contents one by one.

7.3 Filter

The function of filter is to restrict the query to only a subset of the index. Its function is a bit like the where in the SQL statement, but it is different. It is not a part of the regular query, but only preprocesses the data source and then gives it to the query statement. Note that it performs preprocessing rather than filtering query results, so it is expensive to use filter, which can increase the time consuming of a query a hundredfold.

The most commonly used filter is RangeFilter and QueryFilter. RangeFilter is set to search only indexes within the specified range; QueryFilter is to search in the results of the last query.

The use of Filter is very simple, you just need to create an instance of filter and pass it to searcher. Continue with the above example, query "articles with time between 20060101 and 20060130" in addition to limiting writing to query string, you can also write in RangeFilter:

Directory dir = FSDirectory.getDirectory (PATH, false)

IndexSearcher is = new IndexSearcher (dir)

QueryParser parser = new QueryParser ("content", new StandardAnalyzer ())

Query query = parser.parse ("title:lucene content:lucene"

RangeFilter filter = new RangeFilter ("time", "20060101", "20060230", true, true)

Hits hits = is.search (query, filter)

For (int i i < hits.length (); iTunes +)

{

Document doc = hits.doc (I)

System.out.println (doc.get ("title")

}

Is.close ()

7.4 Sort

Sometimes you want an ordered result set, like the "order by" of the SQL statement, which lucene can do: through Sort.

Sort sort Sort ("time"); / / equivalent to "order by time" of SQL

Sort sort = new Sort ("time", true); / / equivalent to "order by time desc" of SQL

Here is a complete example:

Directory dir = FSDirectory.getDirectory (PATH, false)

IndexSearcher is = new IndexSearcher (dir)

QueryParser parser = new QueryParser ("content", new StandardAnalyzer ())

Query query = parser.parse ("title:lucene content:lucene"

RangeFilter filter = new RangeFilter ("time", "20060101", "20060230", true, true)

Sort sort = new Sort ("time")

Hits hits = is.search (query, filter, sort)

For (int I = 0 * * I < hits.length (); iTunes +)

{

Document doc = hits.doc (I)

System.out.println (doc.get ("title")

}

Is.close ()

8 Analyzer

In the previous concept introduction, we already know the function of the parser, which is to segment sentences into words according to semantics. English syncopation already has a very mature analyzer: StandardAnalyzer, in many cases StandardAnalyzer is a good choice. You will even find that StandardAnalyzer can segment Chinese words.

But our focus is on Chinese word segmentation. Can StandardAnalyzer support Chinese word segmentation? Practice has proved that it is possible, but the effect is not good. Searching for "if" will also search out "milk is not as good as juice", and the index file is very large. So what other analyzers do we have on hand? Not in core, we can find two in sandbox: ChineseAnalyzer and CJKAnalyzer. But they also have the problem of inaccurate participle. In contrast, using StandardAnalyzer and ChineseAnalyzer to build the index time is about the same, the index file size is also about the same, CJKAnalyzer performance will be worse, the index file is large and time-consuming.

To solve the problem, first analyze the word segmentation of the three parsers. Both StandardAnalyzer and ChineseAnalyzer divide sentences into individual words, meaning that "milk is not as good as juice" will be divided into "milk is not as delicious as fruit juice", while CJKAnalyzer will be divided into "milk is not as good as if juice tastes better". This explains why a search for "juice" matches this sentence.

The above participle has at least two disadvantages: inaccurate matching and large index file. Our goal is to divide the above sentence into "milk is not as good as juice". The key here is semantic recognition. How do we recognize that "milk" is a word and "milk not" is not a word? We naturally think of lexical segmentation based on the thesaurus, that is, we first get a lexicon, which lists most of the words, and we segment the sentence in some way, when the resulting words match the items in the lexicon. We think this segmentation is correct. In this way, the process of word segmentation is transformed into a matching process, and the simplest ways of matching are forward maximum matching and reverse maximum matching. To put it bluntly, it is a matching from the beginning of the sentence to the back, and a matching from the end of the sentence forward. The lexicon based on the thesaurus is very important, and the capacity of the thesaurus directly affects the search results. Under the premise of the same thesaurus, it is said that the reverse maximum matching is better than the forward maximum matching.

While there are other methods of word segmentation, this is a discipline in itself, and I have not studied it in depth here. Going back to specific applications, our goal is to find mature and ready-made word segmentation tools to avoid reinventing the wheel. After searching on the Internet, we often use the ICTCLAS of the Chinese Academy of Sciences and a free JE-Analysis that is not open source. The problem with ICTCLAS is that it is a dynamic link library, and java calls require local method calls, which is inconvenient and has security concerns, and it does not have a good reputation. JE-Analysis effect is not bad, of course, there will be inaccurate participle, compared to more convenient to rest assured. = new = 0

9 performance optimization

Up to this point, we are still talking about how to make lucene run and complete the assigned task. It is true that most of the functions can be accomplished by using what I said above. However, tests show that the performance of lucene is not very good, and there will even be a return of half a minute under the condition of large amount of data and concurrency. In addition, the initialization and indexing of a large amount of data is also a very time-consuming process. So how to improve the performance of lucene? The following is introduced from two aspects: optimizing index creation performance and optimizing search performance.

9.1 optimize index creation performance

The way to optimize this aspect is relatively limited. IndexWriter provides some interfaces to control the operation of indexing. In addition, we can first write the index to RAMDirectory, and then write the index in batch to FSDirectory. In any case, the goal is to minimize the file IO, because the biggest bottleneck in creating an index is the disk IO. In addition, choosing a better analyzer can also improve some performance.

9.1.1 optimize index establishment by setting parameters of IndexWriter

SetMaxBufferedDocs (int maxBufferedDocs)

Controls the number of document saved in memory before writing a new segment. Setting a larger number can speed up indexing. The default is 10.

SetMaxMergeDocs (int maxMergeDocs)

Control the maximum number of document that can be saved in a segment. A small value helps to append the index. The default Integer.MAX_VALUE is not modified.

SetMergeFactor (int mergeFactor)

Controls the frequency of multiple segment merging. When the value is high, the indexing speed is faster. The default is 10, which can be set to 100 when indexing.

9.1.2 improve performance through RAMDirectory write buffer

We can write the index to RAMDirectory first, and then write it to FSDirectory in batch when it reaches a certain number, so as to reduce the number of disk IO.

FSDirectory fsDir = FSDirectory.getDirectory ("/ data/index", true)

RAMDirectory ramDir = new RAMDirectory ()

IndexWriter fsWriter = new IndexWriter (fsDir, new StandardAnalyzer (), true)

IndexWriter ramWriter = new IndexWriter (ramDir, new StandardAnalyzer (), true)

While (there are documents to index)

{

... Create Document...

RamWriter.addDocument (doc)

If (condition for flushing memory to disk has been met)

{

FsWriter.addIndexes (new Directory [] {ramDir})

RamWriter.close ()

RamWriter = new IndexWriter (ramDir, new StandardAnalyzer (), true)

}

}

9.1.3 choose a better analyzer

This optimization is mainly the optimization of disk space, which can reduce the index file by nearly half, from 600m to 380m under the same test data. But it doesn't help with time, and it takes even longer, because better parsers need to match the thesaurus and consume more cpu, test data takes 133 minutes with StandardAnalyzer, and MMAnalyzer takes 150 minutes.

9.2 optimize search performance

Although the indexing operation is very time-consuming, it is only needed when it is initially created, and it is usually only a small amount of maintenance operations, not to mention that these can be handled by a background process without affecting the user's search. The purpose of our index is to search for users, so the performance of the search is what we are most concerned about. Let's discuss how to improve search performance.

9.2.1 put the index in memory

This is the most intuitive idea because memory is much faster than disk. Lucene provides RAMDirectory to hold indexes in memory:

Directory fsDir = FSDirectory.getDirectory ("/ data/index/", false)

Directory ramDir = new RAMDirectory (fsDir)

Searcher searcher = new IndexSearcher (ramDir)

But practice has proved that RAMDirectory and FSDirectory are about the same speed, when the amount of data is very small, both are very fast, when the amount of data is large (index file 400m) RAMDirectory is even a little slower than FSDirectory, which is really unexpected.

And lucene search is very memory-intensive, even if the 400m index file is loaded into memory, it will be out of memory after running for a period of time, so I think loading memory is not very useful.

9.2.2 optimize time range limit

Since loading memory does not improve efficiency, there must be other bottlenecks. After testing, it is found that the biggest bottleneck is the time limit, so how can we minimize the cost of the time limit?

When you need to search for results within a specified time range, you can:

1. Use RangeQuery to set the range, but the implementation of RangeQuery actually expands the time points within the time range to form BooleanClause to add to the BooleanQuery query, so the time range cannot be set too large. After testing, the range will be thrown BooleanQuery.TooManyClauses for more than a month. It can be expanded by setting BooleanQuery.setMaxClauseCount (int maxClauseCount), but the expansion is limited, and with the expansion of maxClauseCount, the memory footprint is also expanded.

2. Using RangeFilter instead of RangeQuery will not be slower than RangeQuery, but there are still performance bottlenecks. More than 90% of the query time is spent on RangeFilter. After studying its source code, it is found that RangeFilter actually traverses all indexes first, generates a BitSet, marks each document, marks true in the time range, and false if not, and then passes the results to Searcher for lookup, which is very time-consuming.

3. To further improve performance, there are two ideas:

A. Cache Filter results. Since the execution of RangeFilter is before the search, then its input is certain, that is, IndexReader, and IndexReader is determined by Directory, so it can be considered that the result of RangeFilter is determined by the upper and lower limits of the scope, that is, by the specific RangeFilter object, so we just use the RangeFilter object as the key to cache the filter result BitSet. Lucene API has provided a CachingWrapperFilter class that encapsulates Filter and its results, so in practice, we can cache CachingWrapperFilter objects. What we need to note is that don't be misled by the name and description of CachingWrapperFilter. CachingWrapperFilter seems to have a cache function, but the cache is for the same filter, that is, when you use the same filter to filter different IndexReader, it can help you cache the results of different IndexReader, but our needs are just the opposite. We filter the same IndexReader with different filter, so we can only use it as a wrapper class.

B, reduce the time accuracy. The study of the working principle of Filter can be seen that it traverses the entire index every time, so the larger the time granularity, the faster the comparison, the shorter the search time. Without affecting the function, the lower the time accuracy, the better, sometimes it is worth sacrificing a little accuracy, of course, the best case is that there is no time limit at all.

The following is a demonstration of the optimization results for the above two ideas (both adopt the random time range of 800 thread random keywords):

In the first group, the time accuracy is seconds:

Use RangeFilter directly, use cache instead of filter

Average time per thread is 10s 1s 300ms

The second group, the time accuracy is days.

Use RangeFilter directly, use cache instead of filter

Average 900ms 360ms 300ms per thread

A conclusion can be drawn from the above data:

1. Reduce the time accuracy as much as possible, and the performance improvement brought by changing the accuracy from seconds to days is even better than using cache, and it is best not to use filter.

2. Under the condition that the time accuracy can not be reduced, the performance of using cache can be improved by about 10 times.

9.2.3 use of better parsers

This is similar to index creation optimization, the search will naturally speed up when the index file is small. Of course, this improvement is also limited. The performance improvement of the better analyzer is less than 20% compared to the worst analyzer.

10 some experience

10.1 keywords are case sensitive

Keywords such as or AND TO are case-sensitive. Lucene only recognizes uppercase and lowercase words as common words.

10.2 read-write repulsion

There can be only one write operation to the index at a time, and you can search while writing.

10.3 File Lock

Forcibly exiting in the process of writing an index will leave a lock file in the tmp directory, making future writes impossible, which can be deleted manually

10.4 time format

Lucene only supports one time format, yyMMddHHmmss, so if you send a yy-MM-dd HH:mm:ss time to lucene, it will not be treated as time.

10.5 set up boost

Sometimes the weight of a certain field needs to be larger when searching, for example, you may think that articles with keywords in the title are more valuable than articles with keywords in the body, you can set the boost of the title to larger, then the search results will give priority to articles with keywords in the title (no sorting is used under the preceding title). How to use it:

Field. SetBoost (float boost); the default value is 1. 0, which means that the weight needs to be set greater than 1.

After reading the above, do you have any further understanding of how to use and optimize lucene and lucene.NET? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report