What is the principle of Lucene query? 07/06 Update SLTechnology News&Howtos

What is the principle of Lucene query?

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article introduces the relevant knowledge of "what is the principle of Lucene query". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Lucene query principle

This section is mainly some background knowledge of Lucene, which can be skipped by students who understand this knowledge.

Data structure and query principle of Lucene

The bottom layer of Elasticsearch is Lucene. It can be said that the query performance of Lucene determines the query performance of Elasticsearch.

Lucene query principle

The most important data structures in Lucene are its data structures, which determine how the data is retrieved. This article briefly describes several data structures:

FST: save the term dictionary, and you can implement single Term, Term range, Term prefixes and wildcard queries on FST.

Inverted chain: the list of docId corresponding to each term is saved with the structure of skipList, which is used for fast jump.

BKD-Tree:BKD-Tree is a data structure that stores multi-dimensional spatial points and is used for fast lookup of numerical types (including spatial points).

DocValues: column storage based on docId. Due to the characteristics of column storage, it can effectively improve the performance of sort aggregation.

Merge the results of the combination condition

After understanding the data structure and basic query principles of Lucene, we know:

Query a single entry, and Lucene reads the inverted chain of the entry, which contains an ordered list of docId.

For the string range / prefix / wildcard query, Lucene will get all the eligible Term from FST, and then look for the inverted chain based on these Term to find the eligible doc.

Do a range lookup of the numeric type, and Lucene will find the eligible docId collection through BKD-Tree, but the docId in this collection is not ordered.

The question now is, if a combined query condition is given, how can Lucene combine the results of each single condition to get the final result. The problem of simplification is how to find the intersection and union of two sets.

1. Finding the intersection of N inverted chains

The above Lucene principle analysis article said that N inverted chains to find the intersection, you can use skipList, effectively skip the invalid doc.

two。 The union of N inverted chains

Processing method 1: still retain multiple ordered lists, and the queue heads of multiple ordered lists form a priority queue (minimum heap), so that the whole union can be iterator (the queue at the top of the heap is first out of the heap, and the next docID in the queue is on the heap), or you can jump backward through skipList (each sublist jumps through skipList respectively). This method is suitable for scenarios where the number of inverted chains is relatively small (N is relatively small).

Treatment method 2: if there are more inverted chains (N is relatively large), it is not cost-effective to use method one. At this time, you can directly merge the results into an ordered docID array.

Processing method 3: method 2, directly save the original docID, if there is a lot of docID, it consumes a lot of memory, so when the number of doc exceeds a certain value (the size of only one bit,BitSet in 32-bit docID in BitSet depends on the total number of doc in segments, so you can estimate whether BitSet is more cost-effective according to the total number of doc and the current number of BitSet), it will use the way of constructing BitSet, which saves memory very much, and BitSet can take intersection / union very efficiently.

3. How to merge the results of BKD-Tree with other results

The docID found through BKD-Tree is unordered, so either convert it to an ordered docID array, or construct a BitSet, and then merge with other results.

Query order optimization

If multiple conditions are used for query, it will be better to first query with less cost, and then iterate from the small result set. A lot of optimizations are made in Lucene, and the cost of each query is estimated before the query, and then the query order is determined.

Ranking of results

By default, Lucene is sorted by Score, that is, the score value after the score is calculated, and if another Sort field is specified, it is sorted by the specified field. So, does sorting have a significant impact on performance? First of all, sorting does not sort all hit doc, but constructs a heap to ensure that the first (Offset+Size) number of doc is ordered, so the performance of sorting depends on (Size+Offset) and the number of documents hit, in addition to the cost of reading docValues. Because (Size+Offset) is not too large, and the read performance of docValues is very high, sorting does not have a great impact on performance.

Query performance analysis of each scenario

The previous section talked about some query-related theoretical knowledge, so this section is a combination of theory and practice, through some specific test figures to analyze the performance of each scenario. The test uses single Shard, 64-core machine and SSD disk, mainly analyzes the computing cost of each scenario, regardless of the impact of the operating system Cache, the test results are for reference only.

Create an Index, a shard, and no replica in a single Term query ES. There are 10 million rows of data, each with only a few tags and a unique ID, and now write that data to this Index. The Tag1 tag has only two values an and b, and now you need to find a piece of Tag1=a data (about 5 million) from 10 million rows. The following query is given So how long does it take: request: {"query": {"constant_score": {"filter": {"term": {"Tag1": "a"}, "size": 1} 'response: {"took": 233, "timed_out": false, "_ shards": {"total": 1, "successful": 1, "skipped": 0 "failed": 0}, "hits": {"total": 5184867, "max_score": 5184867, "hits":...}

This request consumes 233ms and returns the total number of eligible data: 5184867.

For the query condition of Tag1= "a", we know that it is the inverted chain of Tag1= "a". The length of this inverted chain is 5184867, which is very long, and most of the time is spent scanning this inverted chain. In fact, for this example, the benefit of scanning the inverted chain is to get the total number of records that meet the condition. Because constant_score is set in the condition, there is no need to calculate the score, just return a record that meets the condition. For the scenario to be scored, Lucene calculates the score according to how often the entry appears in doc, and sorts it back.

At present, we have come to a conclusion that the 233ms time can scan at least 5 million of the inverted chain, and considering that a single request is executed by a single thread, it can be roughly estimated that the speed of a CPU core scanning the doc in the inverted chain is tens of millions of times in one second.

Let's change to a smaller inverted chain with a length of 10, 000, which takes a total of 3ms.

{"took": 3, "timed_out": false, "_ shards": {"total": 1, "successful": 1, "skipped": 0, "failed": 0}, "hits": {"total": 10478, "max_score": 10478, "hits":...} Term combined query

First consider the intersection of two Term queries:

For a combined query of Term, the two inverted chains are 10, 000 and 5 million, respectively, and the eligible data after merging is 5000. What is the query performance? Request: {"size": 1, "query": {"constant_score": {"filter": {"bool": {"must": [{"term": {"Tag1": "a" / / inverted chain length 5 million} {"term": {"Tag2": "0" / / inverted chain length 10, 000} response: {"took": 21, "timed_out": false, "_ shards": {"total": 1, "successful": 1, "skipped": 0, "failed": 0} "hits": {"total": 5266, "max_score": 2.0, "hits":...}

This request is time-consuming 21ms, mainly to do the intersection of two inverted chains, so we mainly analyze the performance of skipList.

In this example, the length of the inverted chain is 10 or 5 million, and there are still more than 5000 doc eligible after the merger. For 10, 000 inverted chains, basically no skip, because half of the doc are eligible, for 5 million inverted chains, an average of skip1000 doc at a time. Because the smallest unit of inverted chain in storage is BLOCK, there is usually no skip operation within 128docID,BLOCK of a BLOCK. So even if you can skip to the docID in a certain BLOCK,BLOCK, you still have to scan it sequentially. So in this example, the actual number of docID scanned is also roughly estimated to be hundreds of thousands, so the total time spent more than 20 ms is also expected.

For the union of Term query, change the must of the above bool query to should, and the query result is as follows:

{"took": 393," timed_out ": false," _ shards ": {" total ": 1," successful ": 1," skipped ": 0," failed ": 0}," hits ": {" total ": 5190079," max_score ": 5190079," hits ":...}

It takes time to 393ms, so the time of union is more than the time of a single conditional query.

String range query RecordID is a UUID, 10 million pieces of data, each doc has a unique uuid, from which to find the uuid at the beginning of 0x7, there are probably more than 5 million results, what is the performance? Request: {"query": {"constant_score": {"filter": {"range": {"RecordID": {"gte": "0", "lte": "8"}, "size": 1} response: {"took": 3001, "timed_out": false "_ shards": {"total": 1, "successful": 1, "skipped": 0, "failed": 0}, "hits": {"total": 5185663, "max_score": 5185663, "hits":.} query the uuid at the beginning of a The result is probably more than 600,000, how is the performance? Request: {"query": {"constant_score": {"filter": {"range": {"RecordID": {"gte": "a", "lte": "b"}, "size": 1} response: {"took": 379," timed_out ": false "_ shards": {"total": 1, "successful": 1, "skipped": 0, "failed": 0}, "hits": {"total": 648556, "max_score": 648556, "hits":.}

In this query, we mainly analyze the query performance of FST. From the above results, we can see that the query performance of FST is much worse than that of scanning inverted chain. Similarly, when scanning 5 million of the data, inverted chain scan only needs less than 300ms, while the scan on FST takes 3 seconds, which is basically ten times slower. For UUID-length strings, the performance of FST range scanning is about millions per second.

String range query plus Term query string range query (condition 5 million), plus two Term queries (condition 5000), the final number of conditions 2600, what is the performance? Request: {"query": {"constant_score": {"filter": {"bool": {"must": [{"range": {"RecordID": {"gte": "0" "lte": "8"}, {"term": {"Tag1": "a"}} {"term": {"Tag2": "0"}]}, "size": 1} result: {"took": 2849, "timed_out": false, "_ shards": {"total": 1, "successful": 1, "skipped": 0, "failed": 0} "hits": {"total": 2638, "max_score": 2638, "hits":...}

In this example, most of the time consumed by the query is still in the part of scanning the FST, scanning the qualified Term through FST, then reading the list of docID corresponding to each Term, constructing a BitSet, and then intersecting the inverted chain of two TermQuery.

Digital Range query for numeric types, we also look for 5 million from 10 million data? Request: {"query": {"constant_score": {"filter": {"range": {"Number": {"gte": 100000000, "lte": 150000000}, "size": 1} response: {"took": 567, "timed_out": false, "_ shards": {"total": 1 "successful": 1, "skipped": 0, "failed": 0}, "hits": {"total": 5183183, "max_score": 5183183, "hits":...}

In this scenario, we mainly test the performance of BKD-Tree, and we can see that the performance of BKD-Tree query is still good. Finding 5 million doc costs more than 500 ms, which is only twice as bad as scanning inverted chain. Compared with FST, the performance has been greatly improved. Location-related queries are also implemented through BKD-Tree with high performance.

Digital Range query plus Term query here we construct a complex query scenario, digital Range range data 5 million, plus two Term conditions, and finally more than 2600 qualified data, what is the performance? Request: {"query": {"constant_score": {"filter": {"bool": {"must": [{"range": {"Number": {"gte": 100000000 "lte": 150000000}, {"term": {"Tag1": "a"}} {"term": {"Tag2": "0"}]}, "size": 1} response: {"took": 27, "timed_out": false, "_ shards": {"total": 1, "successful": 1, "skipped": 0, "failed": 0} "hits": {"total": 2638, "max_score": 2638, "hits":...}

This result is beyond our expectation, it only needs 27ms! It is because in the previous example, the digital Range query took more than 500 ms, but after we added two Term conditions, the time changed to 27ms. Why?

In fact, Lucene does an optimization here, and there is a query called IndexOrDocValuesQuery at the bottom, which automatically determines whether to query Index (BKD-Tree) or DocValues. In this example, the query order is to first intersect the two TermQuery to get more than 5000 docID, and then read the docValues corresponding to the more than 5000 docID to filter the data that meets the digital Range condition. Because you only need to read the docValues of more than 5000 doc, it takes very little time.

This is the end of the content of "what is the principle of Lucene query". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.