Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the reason why Elasticsearch queries are so fast?

2025-04-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly introduces "what is the reason why the Elasticsearch query speed is so fast". In the daily operation, I believe that many people have doubts about the reason why the Elasticsearch query speed is so fast. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts of "what is the reason why the Elasticsearch query speed is so fast?" Next, please follow the editor to study!

This is even faster than querying through the primary key using MySQL locally.

To this end, I searched the relevant information:

There are many answers to this kind of questions online, the general meaning is as follows: ES is a full-text search engine based on Lucene, it will segment the data and save the index, and is good at managing a large amount of index data. Compared with MySQL, it is not good at updating data and related queries frequently.

What is said is not very thorough, and there is no principle of parsing, but since the index is mentioned repeatedly, let's compare the difference between the two from the perspective of the index.

MySQL index

Starting with MySQL, we must be familiar with the word index, which usually exists in some query scenarios, which is a typical case of space for time. The following takes the InnoDB engine as an example.

Common data structures

Assuming that we design the index of MySQL ourselves, what are the options?

① hash table

The first thing we should think of is the hash table, which is a very common and efficient query and write data structure, corresponding to the Java is HashMap.

This data structure should not need to be introduced too much, its writing efficiency is very high O (1), for example, when we want to query the data of id=3, we need to hash 3, and then find the corresponding position in this array.

But if we want to query interval data such as 1 ≤ id ≤ 6, the hash table can not be satisfied very well, because it is unordered, so we have to traverse all the data to know which data belongs to this interval.

② ordered array

The query efficiency of ordered array is also very high. When we want to query the data of id=4, we only need binary search to locate the data O (logn) efficiently.

At the same time, because the data is also ordered, it can naturally support interval queries; so ordered arrays are suitable for indexing?

Naturally, no, it has another big problem; if we insert id=2.5 data, we have to move all subsequent data by one bit at the same time, and this write efficiency becomes very inefficient.

③ balanced binary tree

Since the writing efficiency of ordered arrays is not high, let's take a look at the ones with high writing efficiency, and it's easy to think of binary trees.

Here we take the balanced binary tree as an example:

Because of the characteristics of balanced binary tree: the left node is smaller than the parent node, and the right node is larger than the parent node.

So suppose we want to query the data of id=11, we only need to query 10 → 12 → 11 to finally find the data, the time complexity is O (logn), and the same is O (logn) when writing data.

However, it still can not support interval search very well. If we want to query the data of 5 ≤ id ≤ 20, we need to query the left subtree of 10 nodes and then the right subtree of 10 nodes before we can finally query all the data. As a result, such queries are not efficient.

④ jump table

Jump tables may not be as common as hash tables, ordered arrays, and binary trees mentioned above, but in fact, sort set in Redis uses jump tables. Here we briefly introduce the advantages of the data structure implemented by the jump table.

We all know that it is not efficient to query an ordered linked list, because it can not use array subscript for binary search, so the time complexity is o (n).

But we can also skillfully optimize the linked list to achieve binary search in disguise, as shown in the following figure:

We can extract the first-level index and the second-level index for the lowest data, and we can extract the N-level index according to the different amount of data. When we query, we can use the index here to realize the binary search in disguise.

Suppose that if you want to query the data of id=13, you only need to traverse 1 → 7 → 10 → 13 four nodes to query the data. The greater the number, the more efficient you will be.

At the same time, interval query is also supported, similar to the previous query of a single node, only need to query the starting node, and then traverse back (ordered list) to the target node to query the entire range of data.

At the same time, because we don't store real data on the index, we just store a pointer, which takes up negligible space relative to the linked list where the data is stored at the bottom.

Optimization of balanced binary Tree

But in fact, InnoDB in MySQL does not use a jump table, but uses a data structure called B + tree.

Unlike binary trees, this data structure is often mentioned by university teachers as the basic data structure, because this kind of data structure is evolved in the basic data structure according to the requirement scenario in the actual project.

For example, the B + tree here can be thought of as evolved from a balanced binary tree. Just now we mentioned that the interval query efficiency of binary tree is not high, which can be optimized:

Optimized on the basis of the original binary tree: all the non-leaves do not store data, but as the index of the leaf node, the data are all stored in the leaf node.

In this way, the data of all leaf nodes are stored in an orderly manner, which can well support interval query. You just need to query the location of the starting node, and then iterate back in the leaf node.

When the amount of data is large, it is obvious that the index file cannot be stored in memory, and although it is fast, it consumes a lot of resources, so MySQL stores the index file directly on disk.

This is slightly different from the index mentioned later in Elasticsearch. Because the index is stored on disk, we should reduce the IO of disk as much as possible (the efficiency of disk IO is not in the same order of magnitude as memory).

As can be seen from the above figure, we have to IO at least 4 times to query a piece of data. It is obvious that the number of IO is closely related to the height of the tree. The lower the height of the tree, the fewer IO times and the better the performance.

So how can we reduce the height of the tree?

We can try to change the binary tree into a trigeminal tree, so that the height of the tree will be reduced a lot, so that the number of IO when querying data will naturally be reduced, and the query efficiency will be greatly improved. This is actually the origin of the B + tree.

Some suggestions for using indexes

In fact, through the understanding of the B+ tree in the figure above, we can also optimize some small details of daily work; for example, why should it be preferably incremented in order?

Assuming that the primary key data we write is unordered, it is possible that the id of the data written later is smaller than that written before, so that the written data may need to be moved when maintaining the B+ tree index.

If you are writing data incrementally, you will not have this consideration, you only need to write in turn each time. That's why we ask the database primary key to be trend increasing as far as possible, and the most reasonable thing to do without considering the sub-table is to increase the primary key by itself.

On the whole, the idea is similar to the jump table, except that relevant adjustments have been made to the usage scenario (for example, all the data is stored in the leaf node).

ES index

Now that MySQL is done talking, let's take a look at how Elasticsearch uses indexes.

Perpendicular index

In ES, we use a data structure called inverted index; before we officially talk about inverted index, let's talk about the opposite of forward index.

As an example of the above figure, the way we can query specific objects through doc_id is called using a forward index, which can also be understood as a hash table.

The essence is to find value through key. For example, the data name=jetty wang,age=20 can be quickly queried through doc_id=4.

Inverted index

So if it's the other way around, what kind of data do I want to query the name that contains li? How to query efficiently?

Just using the forward index mentioned above obviously won't work, only traversing all the data in turn to determine whether the name contains li; this is very inefficient.

But if we rebuild an index structure:

When you want to query the data that contains li in name, you only need to query the data contained in Posting List through this index structure, and then query the final data by mapping.

This index structure is actually an inverted index.

Term Dictionary

But how to efficiently query the li in this index structure? combined with our previous experience, as long as we arrange the Term in order, we can use the data structure of the binary tree search tree to query the data under o (logn).

The process of splitting a text into an independent Term is actually what we often call a participle.

Combining all the Term together is a Term Dictionary, which can also be called a word dictionary.

English word segmentation is relatively simple, it only needs to separate the text by spaces and punctuation marks, while Chinese is relatively complex, but it is also supported by many open source tools (as it is not the focus of this article, those interested in word segmentation can search on their own).

When we have a large amount of text, there will be a lot of Term after participle. It is certainly not enough for such an inverted index data structure to be stored in memory, but if it is stored on disk like MySQL, it is not so efficient.

Term Index

So we can choose a compromise, since we can't put the entire Term Dictionary in memory, we can create an index for Term Dictionary and put it in memory.

In this way, Term Dictionary can be queried efficiently, and finally Posting List can be queried through Term Dictionary.

Disk IO is also reduced several times compared to the B+ tree in MySQL.

This Term Index we can use such a Trie tree, that is, we often say the dictionary tree to store.

If we are searching for Term that starts with j, the first step is to find out where the Term that starts with j is in the Term Dictionary dictionary file through the Term Index in memory (this location can be a file pointer, may be a range).

Then, after taking out all the Term in this location interval, because it has been sorted, you can quickly locate the specific location through binary search; in this way, you can query out the Posting List.

Finally, the target data can be retrieved in the original file through the location information in Posting List.

More optimization

Of course, Elasticsearch also does a lot of targeted optimization, when we retrieve the two fields, we can use Bitmap to optimize.

For example, now we need to query the data of name=li and age=18, and then we need to use these two fields to extract the respective result Posting List.

The easiest way is to iterate through two sets to extract duplicate data, but this is obviously inefficient.

At this point, we can use Bitmap for storage (and save storage space), while using innate bits and calculations to get the results.

[1, 3, 5] ⇒ 10101 [1, 2, 4, 5] ⇒ 11011

In this way, the result can be obtained by summing two binary arrays:

10001 ⇒ [1, 5]

Finally, the inverse solution of Posting List is [1,5], which is naturally much more efficient. The same query requirements are not specially optimized in MySQL, but the efficiency of filtering the second field after filtering out the data with small amount of data is naturally not as high as that of ES.

Of course, Posting List will also be compressed in the latest version of ES. You can see the official documentation for specific compression rules, so I won't introduce them here.

Summary

Finally, let's sum up:

From the above, we can see that the most complex products are ultimately composed of basic data structures, which will only be optimized for different application scenarios. Therefore, after laying the foundation of data structures and algorithms, you can quickly get started with some new technology or middleware, and even know the direction of optimization.

At this point, the study of "what is the reason why Elasticsearch query speed is so fast" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report