How to analyze the principle of Lucene query 07/03 Update SLTechnology News&Howtos

How to analyze the principle of Lucene query

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article will explain in detail how to parse the principle of Lucene query, the content of the article is of high quality, so the editor will share it for you as a reference. I hope you will have a certain understanding of the relevant knowledge after reading this article.

Preface

Lucene is a full-text information retrieval toolkit based on Java. At present, the mainstream search systems Elasticsearch and solr are based on lucene indexing and search capabilities. If you want to understand the implementation principle of the search system, you need to go deep into the lucene layer to see how lucene stores the data that needs to be retrieved, and how to accomplish efficient data retrieval.

Because of the existence of indexes in the database, many efficient query operations can also be supported. However, compared with lucene, the query ability of the database will still be much weaker. This article will explore which queries are supported by lucene, and will focus on several types of queries to analyze how the internal lucene is implemented. In order to make it easy for you to understand, we will first briefly introduce some basic concepts in lucene, and then expand several data storage structures in lucene. After understanding their storage principles, we can easily know how to achieve efficient search based on these storage structures. This article focuses on lucene how to achieve the traditional database is more difficult to do the query, for word segmentation, scoring and other functions will not be introduced.

Lucene data model

There are four basic data types in Lucene, which are:

Index: index, made up of many Document.

Document: made up of many Field, it is the smallest unit of Index and Search.

Field: consists of many Term, including Field Name and Field Value.

Term: consists of a lot of bytes. Generally speaking, each minimum unit after the Field Value participle of Text type is called Term.

In lucene, read-write paths are separate. Create an IndexWriter when writing and an IndexSearcher when reading

The following is a simple code example of how to build an index using lucene's IndexWriter and how to use indexSearch for search queries.

Analyzer analyzer = new StandardAnalyzer (); / / Store the index in memory: Directory directory = new RAMDirectory (); / / To store an index on disk, use this instead: / / Directory directory = FSDirectory.open ("/ tmp/testindex"); IndexWriterConfig config = new IndexWriterConfig (analyzer); IndexWriter iwriter = new IndexWriter (directory, config); Document doc = new Document (); String text = "This is the text to be indexed." Doc.add (new Field ("fieldname", text, TextField.TYPE_STORED); iwriter.addDocument (doc); iwriter.close (); / / Now search the index: DirectoryReader ireader = DirectoryReader.open (directory); IndexSearcher isearcher = new IndexSearcher (ireader); / / Parse a simple query that searches for "text": QueryParser parser = new QueryParser ("fieldname", analyzer); Query query = parser.parse ("text") ScoreDoc [] hits = isearcher.search (query, 1000) .scoreDocs; / / assertEquals (1, hits.length); / / Iterate through the results: for (int I = 0; I < hits.length; iDocs +) {Document hitDoc = isearcher.doc (isearcher.doc [I] .doc); System.out.println (hitDoc.get ("fieldname"));} ireader.close (); directory.close ()

As you can see from this example, lucene reads and writes have their own action classes. This article focuses on the read logic. When using the IndexSearcher class, you need a DirectoryReader and a QueryParser, where DirectoryReader needs to correspond to the Directory implementation at the time of writing. QueryParser is mainly used to parse your query, for example, if you want to look up "An and B", there will be a mechanism within lucene to parse the intersection of term An and term B. Specify a maximum number of documents to return when performing Search, because there may be too many hits, we can limit the maximum number of documents returned by words, and do paging returns.

The following describes in detail the several steps that an index query goes through, and what optimizations lucene has done for each step.

Lucene query process

Queries in lucene are based on segment. Each segment can be thought of as a separate subindex, and during the indexing process, the lucene will constantly persist the data in flush memory to form a new segment. Multiple segment will also be continuously merge into a large segment, in the old segment and query when reading, will not be deleted, not read and merge segement will be deleted. This process is similar to the merge process of a LSM database. Let's focus on how to implement efficient queries within an segment.

In order to make it easier for you to understand, we take a person's name, age, and student number as an example, how to look up a list of names (with duplicate names).

Docidnameageid1Alice181012Alice201023Alice211034Alan211045Alan18105

In order to query such a condition of name=XXX in lucene, an inversion chain based on name is established. Taking the above data as an example, the inverted chain is as follows:

Name

Alice | [1, 2, 3]

-|-|

Alan | [4pr 5]

If we also want to query by age, for example, if we want to check the list of age = 18, we can also create another inverted chain:

18 | [1pr 5]

-- |-|

20 | [2]

21 | [3pr 4]

Here, Alice,Alan,18, these are all term. So inversion is essentially based on the reverse list of term, which is convenient for attribute search. Here we have a very natural question, if there is a lot of term, how to get this inverted chain quickly? The concept of term dictonary, the dictionary of term, is introduced in lucene. In the term dictionary, we can sort by term, so a binary lookup can be used to determine the address of the term. This kind of complexity is logN, and when there are many term, the efficiency still needs to be further improved. You can use a hashmap, and when a term enters, the hash continues to look for the inverted chain. The hashmap approach here can be thought of as an index of term dictionary. Starting from lucene4, in order to facilitate the implementation of complex query statements such as rangequery or prefixes and suffixes, lucene uses FST data structures to store term dictionaries. The storage structure of FST is described in detail below.

FST

Let's take a look at the construction process of FST using the words Alice and Alan as examples. First, sort all the words as "Alice" and "Alan".

Insert "Alan"

Insert "Alice"

With this SkipList, for example, if we want to find docid=12, we may need to scan the original linked list one by one. After having SkipList, you first visit the first layer and see that it is then greater than 12, enter the 0 layer and walk to 3J8, and find that 15 is greater than 12, and then enter the 8 of the original linked list and continue to go down through 10 and 12.

With the introduction of FST and SkipList, we can generally draw the following diagram to illustrate how lucene implements the entire inverted structure:

Merging occurs in lucene in the following order:

Start traversing at termA and get the first element docId=1

Set currentDocId=1

In termB search (currentDocId) = 1 (returns a doc greater than or equal to currentDocId)

Because currentDocId = = 1, continue

If the currentDocId and the returned are not equal, execute 2 and then continue

It still matches after termC. The result is returned.

CurrentDocId = nextItem of termC

Then proceed to step 3 and cycle in turn. Until some inverted chain to the end.

Throughout the merge step, I can find that if a chain is very short, the number of comparisons will be greatly reduced, and because of the existence of the SkipList structure, it is faster to locate a docid in an inversion without traversing one by one. The final result can be returned quickly. The whole process of positioning, querying and merging constitutes the query process of lucene. Compared with the index of traditional database, the optimization of lucene merging process reduces the IO of reading data, and the flexibility of inverted merging also solves the problem that it is difficult for traditional indexes to support multi-conditional query.

BKDTree

If you want to do a range search in lucene, you can see from the above FST model that you need to traverse the FST to find a point that contains the range, then enter the corresponding inverted chain, and then perform the union operation. But if it is a numeric type, such as a floating-point number, there may be a lot of potential term, which makes the query inefficient. So in order to support efficient numeric classes or multi-dimensional queries, lucene introduces the class BKDTree. BKDTree is based on KDTree, which divides the data into dimensions and establishes a binary tree to ensure a balanced number of nodes on both sides of the tree. In the one-dimensional scenario, KDTree will degenerate into a binary search tree. If we want to find an interval in the binary search tree, the complexity of logN will access the leaf node to get the corresponding inverted chain. As shown in the following figure:

BKDTree is a variant of KDTree, because you can see that KDTree is still expensive if new nodes are added or nodes are modified. Similar to LSM's merge idea, BKD is also multiple KDTREE, and then persistent merge is finally merged into one. However, we can see that if you use the index type of BKDTree for a term type, it is not so efficient with the ordinary inverted chain merge, so there is a balance here. One way of thinking is to add another type of term to the BKDTree index as a dimension.

How to sort and aggregate the returned results

From the previous introduction, we can see that lucene implements the search of term through an inverted storage model, so sometimes we need to aggregate the value of another attribute, or we want the returned results to be sorted by another attribute. Before lucene4, all the results need to be sorted by reading the original text, which is inefficient and takes up more memory. In order to speed up the implementation of fieldcache in lucene, the read field is put into memory. This reduces repetitive IO, but it also creates a new problem, which is that it takes up more memory. The new version of lucene introduces DocValues,DocValues as a docid-based column storage. When we get a series of docid, sorting can be done using this column storage, combined with a heap sort. Of course, the extra column storage takes up extra space, and lucene can choose whether it needs DocValue storage and which fields to store when building the index.

On how to parse the principle of Lucene query to share here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.