Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Differences between MySQL Index and VS ElasticSearch Index

2025-03-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly introduces the difference between MySQL index VS ElasticSearch index, which has certain reference value, and friends who need it can refer to it. I hope you all have a lot to gain after reading this article. Let's take a look at it together.

preface

During this period of time, I was maintaining the search function of the product. Every time I saw elasticsearch's efficient query efficiency in the admin desk, I was curious how he did it.

This is even faster than querying by primary key locally using MySQL.

So I searched for relevant information:

There are many answers to these questions online. The general meaning is as follows:

ES is a full-text search engine based on Lucene, which saves indexes after data segmentation. It is good at managing a large amount of index data, but is not good at updating data and associated queries frequently compared to MySQL.

It is not very thorough, there is no relevant principle of analysis; however, since the index is repeatedly mentioned, we will compare the differences between the two from the perspective of the index.

MySQL index

Starting with MySQL, the word index must be familiar to everyone, usually existing in some query scenarios, is a typical case of space for time.

The following is an example of the Innodb engine. Copy code Common data structures

Assuming we design MySQL indexes ourselves, what are our options?

hash table

The first thing we should think of is hash tables, which are a very common and efficient query, write data structure, corresponding to Java is HashMap

This data structure should not need too much introduction, its write efficiency is very high O(1), for example, we want to query id=3 data, we need to hash 3, and then find the corresponding position in this array can be.

However, if we want to query interval data such as 1≤id≤6, the hash table cannot be well satisfied, because it is disordered, so we have to traverse all the data to know which data belong to this interval.

sorted arrays

Ordered array query efficiency is also very high, when we want to query id=4 data, only need to find the data O(logn) efficiently through binary search.

At the same time, since the data is also ordered, it naturally supports interval queries; so it seems that ordered arrays are suitable for indexing.

No, it has another big problem; if we insert data with id=2.5, we have to shift all subsequent data by one bit at the same time, which becomes very inefficient.

balanced binary tree

Since the writing efficiency of ordered arrays is not high, then we will look at the writing efficiency, it is easy to think of binary trees; here we take balanced binary trees as an example:

Due to the balanced binary tree characteristics:

The left node is smaller than the parent node, and the right node is larger than the parent node.

So suppose we want to query the data with id=11, we only need to query 10->12->11 to finally find the data, the time complexity is O(logn), and the same is true when writing data.

However, it still cannot support interval range search very well. Suppose we want to query the data with 5≤id≤20, we need to query the left subtree of 10 nodes first and then query the right subtree of 10 nodes to finally query all the data.

This leads to inefficient queries.

skip list

Skip tables may not be as common as the hash tables, ordered arrays, and binary trees mentioned above, but in fact the sort set in Redis is implemented using skip tables.

Here we briefly describe the advantages of the data structure implemented by the drop-down table.

We all know that even querying an ordered list is inefficient, because it cannot use array subscripts for binary search, so the time complexity is o(n)

But we can also skillfully optimize the linked list to achieve binary search in disguise, as shown below:

We can extract the first level index and the second level index for the lowest level of data, and we can extract the N level index according to the different amount of data.

When we query, we can use the index here to achieve a binary search in disguise.

Suppose you want to query the data with id=13, you only need to traverse 1->7->10->13 to query the data. When the number is larger, the efficiency will be more obvious.

At the same time, interval query is also supported. Similar to querying a single node just now, you only need to query to the starting node, and then traverse back (linked list order) to the target node to query the entire range of data.

At the same time, since we do not store real data on the index, just a pointer, relative to the lowest level of data stored in the linked list for the space occupied can be ignored.

Optimization of Balanced Binary Tree

However, Innodb in MySQL does not use jump tables, but uses a data structure called B+ tree.

This data structure is not like a binary tree, which university teachers often talk about as a basic data structure, because this kind of data structure is evolved from the basic data structure according to the demand scenario in actual engineering.

For example, the B+ tree here can be thought of as evolving from a balanced binary tree.

Just now we mentioned that the interval query efficiency of binary tree is not high, so we can optimize it for this point:

After optimization on the basis of the original binary tree: all non-leaves do not store data, but as the index of leaf nodes, all data are stored in leaf nodes.

In this way, the data of all leaf nodes are stored in order, which can support interval queries well.

You just need to query to the location of the starting node first, and then traverse through the leaf nodes in turn.

When the amount of data is large, it is obvious that the index file cannot be stored in memory, although the speed is fast but the consumption of resources is not small; so MySQL will store the index file directly on disk.

This is slightly different from elasticsearch's index mentioned later.

Since the index is stored on disk, we want to reduce IO to disk as much as possible (disk IO efficiency is not the same order of magnitude as memory)

As can be seen from the above figure, we have to query a piece of data at least 4 times IO, it is obvious that the number of IO is closely related to the height of the tree, the lower the tree height, the fewer the number of IO times, and the better the performance.

How can we reduce the height of trees?

We can try to change the binary tree into a ternary tree, so that the height of the tree will drop a lot, so that the IO times when querying the data will naturally decrease, and the query efficiency will also increase a lot.

This is actually the origin of B+ trees.

Some suggestions for using indexes

In fact, through the understanding of B+ tree in the above picture, we can also optimize some small details of daily work; for example, why do we need to increase orderly?

Assuming that the primary key data we write is out of order, it is possible that the id of the data written later is smaller than that written before, so it is possible to move the data already written when maintaining the B+ tree index.

If the data is written incrementally, there is no such consideration, and it is only necessary to write in sequence each time.

Therefore, we will require the database primary key to be increasing trend as much as possible. The most reasonable way to increase the primary key is to increase the primary key automatically without considering the situation of sub-tables.

Overall, the idea is similar to that of skipping tables, except that relevant adjustments have been made for the use scenarios (for example, all data is stored in leaf nodes).

ES index

MySQL chat is over, now let's see how Elasticsearch uses indexes.

forward index

ES uses a data structure called inverted index; before we talk about inverted index, let's talk about its opposite, positive index.

The above figure is an example. The way we can query specific objects through doc_id is called using a positive index. In fact, it can also be understood as a hash table.

The key is used to find the value.

For example, doc_id=4 can quickly query the name=jetty wang,age=20 data.

inverted index

So if I want to query the data that contains li in name, what is it? How to query efficiently?

Just by the positive index mentioned above obviously does not play any role, only after traversing all the data in turn to determine whether the name contains li ; this is very inefficient.

But if we reconstruct an index structure:

When you want to query the data containing li in name, you only need to query the data contained in Posting List through this index structure, and then query the final data through mapping.

This index structure is actually an inverted index.

Term Dictionary

But how to efficiently query li in this index structure? Combined with our previous experience, as long as we arrange Term in order, we can use the binary tree search tree data structure to query the data under o(logn).

The process of splitting a text into separate terms is actually what we often call participles.

All the terms are combined into a term Dictionary, also known as a word dictionary.

English word segmentation is relatively simple, only need to separate the text by spaces, punctuation marks can be separated words, Chinese is relatively complex, but there are also many open source tools to do support (because it is not the focus of this article, interested in word segmentation can search for themselves).

When our text volume is huge, the Term after the participle will also be a lot, such an inverted index data structure if stored in memory that is certainly not enough to save, but if stored in disk like MySQL, efficiency is not so high.

Term Index

So we can choose a compromise method, since we can't put the whole Term Dictionary in memory, we can create an index for Term Dictionary and put it in memory.

In this way, you can query Term Dictionary efficiently, and finally query Posting List through Term Dictionary.

It also reduces disk IO several times compared to B+ trees in MySQL.

This Term Index can be stored using such a Trie tree, which is what we often call a dictionary tree.

For more information on the dictionary tree, please see here.

If we are searching for a Term starting with j, the first step is to find out where the Term starting with j is in the Term Dictionary file by looking up the Term Index in memory (this position can be a file pointer, possibly an interval range).

Immediately after taking out all the Terms in this position interval, since the order has been arranged, you can quickly locate the specific position by binary search; this way you can query the Posting List.

Finally, the target data can be retrieved from the original file through the location information in the Posting List.

more optimization

Of course ElasticSearch also does a lot of targeted optimization, when we search for two fields, we can use bitmap optimization.

For example, now we need to query the data of name=li and age=18. In this case, we need to take out the respective result Posting List through these two fields.

The easiest way is to traverse both sets separately and extract duplicate data, but this is obviously inefficient.

Then we can use bitmap storage (also save storage space), while using the innate bit and ** calculation can be obtained. **

[1, 3, 5] ⇒ 10101

[1, 2, 4, 5] ⇒ 11011

The sum of these two binary arrays yields the result:

10001 ⇒ [1, 5]

[1, 5] is the most common way to solve this problem.

The same query requirements are not specially optimized in MySQL, but the data with small data volume is filtered first and then the second field is filtered. The efficiency is naturally not as high as ES.

Of course, in the latest version of ES, the Posting List will also be compressed. The specific compression rules can be viewed in the official documentation, so there is no specific introduction here.

Thank you for reading this article carefully, hope Xiaobian share MySQL index VS ElasticSearch index difference content to help everyone, but also hope that everyone more support, pay attention to industry information channels, encounter problems to find, detailed solutions waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Database

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report