Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the full-text indexing engine Lucene based on Java?

2025-03-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

Today I will show you what the Java-based full-text indexing engine Lucene looks like. The content of the article is good. Now I would like to share it with you. Friends who feel in need can understand it. I hope it will be helpful to you. Let's read it along with the editor's ideas.

Full-text indexing engine Lucene based on Java

Lucene is a full-text indexing toolkit based on Java.

An introduction to Lucene, a Java-based full-text indexing engine: a history of authors and Lucene

Implementation of full-text Retrieval: comparison between Luene full-text Index and Database Index

A brief introduction to the Mechanism of Chinese word Segmentation: comparison of algorithms based on Thesaurus and automatic word Segmentation

Specific installation and use introduction: system structure introduction and demonstration

Hacking "> Hacking Lucene: simplified query analyzer, implementation of deletion, customized sorting, extension of application interface

What else can we learn from Lucene?

Full-text indexing / Retrieval engine based on Java-Lucene

Lucene is not a complete full-text indexing application, but a full-text indexing engine toolkit written in Java. It can be easily embedded into various applications to achieve full-text indexing / retrieval functions.

Author of Lucene: Lucene contributor Doug Cutting is a senior full-text indexing / retrieval expert, was once the main developer of V-Twin search engine (one of the achievements of Apple's Copland operating system), and later worked as a senior system architect at Excite, currently engaged in some research on the underlying architecture of INTE.NET. The goal of his Lucene is to add full-text search capabilities to a variety of small and medium-sized applications.

The development of Lucene: it was published on the author's own www.lucene.com and later on sourceforge.net/projects/lucene/ "> SourceForge. At the end of 2001, it became a sub-project of the apache Foundation jakarta: http://jakarta.apache.org/lucene/.

Many Java projects have used Lucene as their background full-text indexing engine, the more famous ones are:

Jive:web forum system

Eyebrows: mailing list HTML archiving / browsing / querying system, the author of the main reference document "The Lucene search engine: Powerful, flexible, and free" is one of the main developers of EyeBrows system, and EyeBrows has become the main mailing list archiving system of APACHE project.

XML.apache.org/cocoon/index.html "> Cocoon: web publishing framework based on XML. Lucene is used in full-text search.

Eclipse.org "> Eclipse: an open development platform based on Java, and Lucene is used to help the full-text indexing.

For Chinese users, the most concerned issue is whether they support Chinese full-text retrieval. But through the following introduction to the structure of Lucene, you will know that due to the good architecture design of Lucene, the support for Chinese can be achieved by extending its language lexical analysis interface.

Implementation Mechanism of full-text Retrieval

The design of api interface of Lucene is relatively general, and the input and output structure is very similar to the table = > record = = > field of database, so many traditional application files and databases can be easily mapped to the storage structure / interface of Lucene. Overall: think of Lucene as a database system that supports full-text indexing.

Compare Lucene with the database:

Lucene database

Index data source: doc (field1,field2...) Doc (field1,field2...)

Indexer /

_____________

| | Lucene Index |

-

/ searcher

Result output: Hits (doc (field1,field2) doc (field1...))

Index data source: record (field1,field2...) Record (field1..)

Sql: insert/

_____________

| | db Index |

-

/ SQL: select

Result output: results (record (field1,field2..)) Record (field1...))

Document: a unit that needs to be indexed

A Document consists of multiple fields Record: record, including multiple fields Field: field Field: field Hits: query result set, RecordSet: query result set composed of matching Document: query result set, composed of multiple Record

Full-text search ≠ like "keyword%"

Usually thick books are often followed by keyword index tables (for example: Beijing: 12, 34 pages, Shanghai: 3, 77 pages. Which can help readers find the page number of the relevant content more quickly And the principle that database index can greatly improve the speed of query is the same. Imagine how much faster it is to search through the index at the back of the book than to flip through the content page by page. Another reason why the index is efficient is that it is sorted. The core of the retrieval system is a sorting problem.

Because the database index is not designed for full-text indexing, the database index does not work when using like "% keyword%", and when using like query, the search process becomes similar to the traversal process of turning pages, so for database services with fuzzy queries, LIKE is very harmful to performance. If you need to fuzzy match multiple keywords: like "% keyword1%" and like "% keyword2%". Its efficiency can be imagined.

Therefore, the key to establishing an efficient retrieval system is to establish a reverse index mechanism similar to the science and technology index. when the data sources (such as multiple articles) are stored in the sort order, there is another sorted list of keywords. Used to store keyword = = > article mapping relationship, using this mapping relationship index: [keyword = > article number of keywords The number of occurrences (even location: start offset, end offset), occurrence frequency], the retrieval process is the process of turning a fuzzy query into a logical combination of multiple precise queries that can make use of indexes. As a result, the efficiency of multi-keyword query is greatly improved, so the full-text retrieval problem boils down to a sorting problem.

From this, we can see that the fuzzy query is a very uncertain problem relative to the accurate query of the database, which is also the reason why most databases have limited support for full-text retrieval. The core feature of Lucene is that it implements the full-text indexing mechanism that the traditional database is not good at through the special index structure, and provides an extended interface to facilitate customization for different applications.

You can compare the fuzzy query of the database by looking at the table:

The Lucene full-text indexing engine database index establishes the reverse index of all the data in the data source through the full-text index. For the LIKE query, the traditional index of the data is not needed at all. The data needs to be matched by GREP-like fuzzy matching one by one, which is several orders of magnitude lower than the indexed search speed. The matching effect is matched by term, and through the implementation of the language analysis interface, non-English language such as Chinese can be supported. Use: like "% net%" to match the netherlands

Fuzzy matching of multiple keywords: using like "% com%net%": can not match the word order reversal of the xxx.net..xxx.com matching degree matching algorithm, the higher matching degree (similarity) results are ranked first. There is no control on the degree of matching: for example, if net appears 5 words and appears once in the record, the result is the same. The result output outputs the first 100 results with the highest matching degree through a special algorithm, and the result set is buffered and read in small batches. Returns all result sets, requiring a large amount of memory to store these temporary result sets when there are a large number of matching entries (for example, tens of thousands of entries). Customizable through different language analysis interfaces, you can easily customize the index rules that meet the needs of the application (including support for Chinese) without interface or complex interface, unable to customize the fuzzy query application with high load of conclusions, the rules of fuzzy query that need to be responsible, the amount of data used in the index is relatively large, the utilization rate is low, the fuzzy matching rule is simple, or the amount of data that needs fuzzy query is small.

The biggest difference between full-text retrieval and database applications is to make the first 100 most relevant results meet the needs of more than 98% of users.

The innovations of Lucene:

Most search (database) engines use the B-tree structure to maintain the index. Updating the index will lead to a lot of IO operations. Lucene has slightly improved this in its implementation: instead of maintaining an index file, it constantly creates new index files as the index is expanded, and then periodically merges these new small index files into the original large index (for different update strategies). The size of the batch can be adjusted), so as to improve the efficiency of the index without affecting the efficiency of retrieval.

Comparison between Lucene and other full-text retrieval systems / applications:

Lucene other open source full-text retrieval systems incremental indexing and batch indexing can be used for incremental indexing (Append), batch indexing for large amounts of data, and the interface is designed to optimize batch indexing and small batch incremental indexing. Many systems only support bulk indexes, and sometimes indexes need to be rebuilt with a slight increase in data sources. Data source Lucene does not define a specific data source, but the structure of a document, so it can be very flexible to adapt to a variety of applications (as long as the front end has a suitable converter to convert the data source into the corresponding structure). Many systems only aim at web pages and lack the flexibility of documents in other formats. Index content crawling Lucene documents are composed of multiple fields, and you can even control which fields need to be indexed, and those fields do not need to be indexed. Further indexed fields are also divided into types that require and do not need word segmentation:

Indexes that need to be segmented, such as title, article content field

Indexes that do not require word segmentation, such as the lack of generality in the author / date field, often index the entire document to achieve language analysis through different extensions of the language analyzer:

Can filter out unwanted words: an the of, etc.

Western grammar analysis: jumps jumped jumper is summed up as jump for indexing / retrieval

Non-English support: index support for Asian and Arabic languages lacks a common interface to implement query analysis. through the implementation of the query analysis interface, you can customize your own query syntax rules:

For example, concurrent access such as the +-and or relationship between multiple keywords can support the use of multiple users.

On word Segmentation in Asian languages (Word Segment)

For Chinese, the full-text index must first solve the problem of language analysis. for English, the words in sentences are naturally separated by spaces, but the words in Chinese, Japanese and Korean sentences in Asian languages are word by word. First of all, if you want to index the sentence according to "word", how to cut out the word is a big problem.

First of all, it is absolutely impossible to use a single character (si-gram) as the index unit, otherwise, when looking up "Shanghai", it is not possible to match the word "sea".

But in a word: "Tiananmen Square, Beijing", how can the computer be divided according to the language habits of Chinese?

"Beijing Tiananmen Square" or "Beijing Tiananmen Square"? To enable the computer to segment according to language habits, it often requires the machine to have a relatively rich thesaurus in order to more accurately identify the words in the sentence.

Another solution is to use an automatic segmentation algorithm: to segment words in a binary grammar (bigram), such as:

"Beijing Tiananmen Square" > "Beijing Tiananmen Square".

In this way, when querying, whether it is querying "Beijing" or "Tiananmen Square", the query phrases will be segmented according to the same rules: "Beijing" and "Tiananmen Square". The combination of multiple keywords according to the relationship with "and" can also be correctly mapped to the corresponding index. This approach is common to other Asian languages: Korean and Japanese.

The biggest advantage of automatic segmentation is that there is no cost of thesaurus maintenance, the implementation is simple, and the disadvantage is low indexing efficiency, but for small and medium-sized applications, segmentation based on binary syntax is sufficient. The index based on 2-yuan segmentation is generally about the same size as the source file, but for English, the index file is generally only 30% of the original file and 40% different.

The implementation of automatic thesaurus segmentation is very simple to implement complex queries, which increases the complexity of query analysis, and is suitable for implementing more complex query syntax rules, storage efficiency, index redundancy, and index efficiency, which is almost as large as the original text. Maintenance cost is about 30% of the original text size. No vocabulary maintenance cost is very high: languages such as China, Japan and South Korea need to be maintained separately.

It also needs to include word frequency statistics and other applicable areas of embedded systems: the operating environment resources are limited.

Distributed system: no thesaurus synchronization problem

Multilingual environment: professional search engine with no thesaurus maintenance cost and high query and storage efficiency

At present, the language analysis algorithms of larger search engines are generally based on the combination of the above two mechanisms. On the Chinese language analysis algorithm, you can look up the keyword "word segment search" in google to find more relevant information.

Installation and use

Download: http://jakarta.apache.org/lucene/

Note: some of the more complex lexical analysis in Lucene is generated with JavaCC (JavaCC:Java Compiler Compiler, pure Java lexical analysis generator), so if you compile from the source code or need to modify the QueryParser in it, customize your own lexical analyzer, you also need to download javacc from http://www.webgain.com/products/java_cc/.

The structure of lucene: index module (index) and retrieval module (search) are the main external application portals for external applications.

Org.apache.Lucene.search/ search entry org.apache.Lucene.index/ Index entry org.apache.Lucene.analysis/ language Analyzer org.apache.Lucene.queryParser/ query Analyzer org.apache.Lucene.document/ Storage structure org.apache.Lucene.store/ underlying IO/ Storage structure org.apache.Lucene.util/ some common data structures

A simple example demonstrates how to use Lucene:

Indexing process: read the file name (multiple) from the command line, store the file into two fields: path (path field) and content (body field), and index the content in full text: the unit of indexing is Document object, and each Document object contains multiple field Field objects. According to different field properties and data output requirements, different indexing / storage field rules can be selected for fields. The list is as follows: method segmentation index storage purpose Field.Text (String name, String value) Yes Yes Yes segmentation index and storage, such as title, content field Field.Text (String name, Reader value) Yes Yes No segmentation index is not stored, such as META information

Not used to return display, but need to retrieve content Field.Keyword (String name, String value) No Yes Yes without split indexing and storage, for example: date field Field.UnIndexed (String name, String value) No No Yes is not indexed, only stored, such as: file path Field.UnStored (String name, String value) Yes Yes No full-text index only, do not store

Public class IndexFiles {

/ / usage: IndexFiles [Index output directory] [list of indexed files].

Public static void main (String [] args) throws Exception {

String indexpath = args [0]

IndexWriter writer

/ / construct a new write indexer with the specified language parser (the third parameter indicates whether it is an appended index)

Writer = new IndexWriter (indexPath, new SimpleAnalyzer (), false)

For (int iTunes 1; iSystem.out.println ("Indexing file" + args [I])

InputStream is = new FileInputStream (args [I])

/ / construct a Document object with 2 fields Field

/ / one is the path path field, which is not indexed, but only stored

/ / one is the content body field, which is full-text indexed and stored

Document doc = new Document ()

Doc.add (Field.UnIndexed ("path", args [I]))

Doc.add (Field.Text ("body", (Reader) new InputStreamReader (is)

/ / write the document to the index

Writer.addDocument (doc)

Is.close ()

}

/ / close the write indexer

Writer.close ()

}

}

You can see during the indexing process:

Language parser provides an abstract interface, so language analysis (Analyser) can be customized. Although lucene provides two general parsers SimpleAnalyser and StandardAnalyser by default, these two parsers do not support Chinese by default, so you need to modify these two parsers to add segmentation rules for Chinese language.

Lucene does not specify the format of the data source, but only provides a general structure (Document object) to accept the input of the index, so the input data source can be: database, WORD document, PDF document, HTML document. As long as the corresponding parsing converter can be designed to construct the data source into a Docuement object, it can be indexed.

For mass data indexing, you can also improve the efficiency of bulk indexing by adjusting the file merge frequency attribute (mergeFactor) of IndexerWrite.

The retrieval process and results show that:

The search results return the Hits object, which allows you to access the content in Document== > Field.

Assuming a full-text search based on the body field, the path field of the query result and the matching degree (SCOre) of the corresponding query can be printed out.

Public class Search {

Public static void main (String [] args) throws Exception {

String indexPath = args [0], queryString = args [1]

/ / points to the searcher of the index directory

Searcher searcher = new IndexSearcher (indexPath)

/ / query parser: use the same language parser as the index

Query query = QueryParser.parse (queryString, "body"

New SimpleAnalyzer ()

/ / search results are stored using Hits

Hits hits = searcher.search (query)

/ / the data of the corresponding fields can be accessed through hits and the matching degree of the query

For (int iTuno; I 0.0f & & / / ignore zeroed buckets

(bits==null | | bits.get (doc) {/ / skip docs not in bits

TotalHits [0] + +

If (score > = minScore) {

/ * previously: Lucene instantiated docID and corresponding matching degree score into the result hit list:

* hq.put (new ScoreDoc (doc, score)); / / update hit queue

* if doc or 1/doc is used instead of score, it will be arranged sequentially or retroactively according to docID

* assume that the data source is already sorted according to a field when it is indexed, and the result is sorted according to docID.

* sorting for a field can even achieve more complex score and docID fitting.

, /

Hq.put (new ScoreDoc (doc, (float) 1/doc))

If (hq.size () > nDocs) {/ / if hit queue overfull

Hq.pop (); / / remove lowest in hit queue

MinScore = ((ScoreDoc) hq.top ()) .score; / / reset minScore

}

}

}

}

}, reader.maxDoc ()

More general input and output interface

Although lucene does not define a definite input document format, more and more people think of using a standard intermediate format as the data import interface for Lucene, and then other data, such as PDF, only need to be converted to a standard intermediate format through the parser to index the data. This intermediate format is mainly based on XML, and there are no less than 5 similar implementations:

Data source: WORD PDF HTML DB other

| /

XML intermediate format

| |

Lucene INDEX

Index process optimization

The index is generally divided into two cases, one is small batch index expansion, the other is large batch index reconstruction. During the indexing process, it is not every time a new DOC is added to the index that the index file is re-written (the file IDOC O is a very resource-consuming thing).

Lucene first carries on the index operation in the memory, and carries on the file writing according to certain batch. The larger the interval between this batch, the less the file will be written, but it will take up a lot of memory. On the contrary, it takes up less memory, but the file IO operation is frequent, and the indexing speed will be very slow. There is a MERGE_FACTOR parameter in IndexWriter that can help you make full use of memory to reduce file operations according to the application environment after constructing the indexer. According to my experience, the default Indexer is written once every 20 records are indexed, and each time the MERGE_FACTOR is increased by 50 times, the index speed can be doubled.

Retrieval cache optimization

Lucene's optimization for full-text retrieval is that after the first index retrieval, the specific contents of all the records (Document) are not read out, but only the ID of the first 100 results (TopDocs) with the highest matching degree is put into the result set cache and returned. Here you can compare the database retrieval: if it is a database retrieval result set of 10000 items. The database must get all the records before returning to the application result set. So even if the total number of retrieved matches is large, the result set of Lucene will not take up a lot of memory space. For general fuzzy retrieval applications, we do not need so many results, and the first 100 items can meet more than 90% of the retrieval needs.

If you have to read later results after the first batch of cache results are used up, Searcher will retrieve and generate a cache twice as large as the last search cache and fetch it back again. So if you construct a Searcher to check 1Mel 120 results, Searcher actually goes through two search processes: after the first 100 entries are fetched, the cached results are used up, and Searcher re-retrieves and constructs a 200 result cache, and so on, 400caches, 800caches. Since each time the Searcher object disappears, these caches are not accessible, you may want to cache the result records, and try to keep the cache number below 100 to make full use of the first result cache, so that Lucene does not waste multiple searches, and you can cache the results at a hierarchical level.

Another feature of Lucene is that the results with low matching degree are automatically filtered out in the process of collecting results. This is also different from the database application that needs to return all the search results.

Some of my attempts:

Support for Chinese Tokenizer: there are two versions, one is generated through JavaCC, the other is rewritten from SimpleTokenizer, and supports numbers and letters TOKEN in English, and iterative index in Chinese.

Indexer based on XML data sources: XMLIndexer, so all data sources can be indexed by XMLIndxer as long as they can be converted to the specified XML according to DTD.

Sort by a field: a searcher that sorts results by record index order: IndexOrderSearcher, so if you need to sort search results by a field, you can have the data source sort by a field (for example: PriceField), so that after indexing, and then using the searcher that is retrieved by the ID order of the record, the result is equivalent to the sorted result of that field.

Learn more from Lucene

Luene is indeed a model of object-oriented design.

All the problems are facilitated by an additional layer of abstraction to facilitate future expansion and reuse: you can achieve your goals by reimplementing them without the need for other modules

Simple application entry Searcher, Indexer, and call a series of underlying components to complete the search task.

The tasks of all objects are very specific: for example, the search process: QueryParser analysis converts the query statement into a series of precise query combinations (Query), reads the index through the underlying index reading structure IndexReader, and uses the corresponding raters to score / sort the search results. All functional modules are highly atomized, so they can be reimplemented without modifying other modules.

In addition to flexible application interface design, Lucene also provides some language analyzer implementations (SimpleAnalyser, StandardAnalyser) suitable for most applications, which is one of the important reasons why new users can get started quickly.

These advantages are worth learning and learning in the future development. As a general-purpose toolkit, Lunece does give a lot of convenience to developers who need to embed full-text search capabilities into their applications.

In addition, through the study and use of Lucene, I also have a deeper understanding of why many database optimization design requirements, such as:

Index the fields as much as possible to improve the query speed, but too many indexes will slow down the update operation of the database table, and the sorting conditions with too many results are often one of the killers of performance.

Many commercial databases provide some optimized parameters for mass data insertion operations, which is similar to the function of the indexer's merge_factor.

20% 80% principle: a large number of search results does not mean good quality, especially for a large set of returned results. How to optimize the quality of the first dozens of results is often the most important.

Let the application get a relatively small result set from the database as much as possible, because random access to the result set is a very resource-consuming operation even for large databases.

The above is the full content of the Java-based full-text indexing engine Lucene, and more related to the Java-based full-text indexing engine Lucene can search the previous articles or browse the following articles to learn ha! I believe the editor will add more knowledge to you. I hope you can support it!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 217

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report