How to understand the underlying principle of MySQL index 07/15 Update SLTechnology News&Howtos

How to understand the underlying principle of MySQL index

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article introduces you how to understand the underlying principle of MySQL index, the content is very detailed, interested friends can refer to, hope to be helpful to you.

Mysql is a very popular database in the Internet, the design of its underlying storage engine and data retrieval engine is very important, especially the storage form of Mysql data and the design of index determine the overall data retrieval performance of Mysql.

We know that the function of index is to do rapid data retrieval, and the essence of fast retrieval is data structure. Through the selection of different data structures, the fast retrieval of all kinds of data is realized. In the database, an efficient search algorithm is very important, because a large amount of data is stored in the database, and an efficient index can save a lot of time. For example, in the following data table, if Mysql does not implement the indexing algorithm, then searching for id=7 can only be done by traversing the data in violent order. Finding id=7 needs to be compared 7 times. If this table stores 1000W data, it will be compared 1000W times to find id=1000W. This speed is unacceptable.

Mysql index underlying data structure selection hash table (Hash)

Hash table is an effective tool for fast data retrieval.

Hashing algorithm: also known as hashing algorithm, is the data structure that converts an arbitrary value (key) into a fixed-length key address through a hash function, through which specific data is carried out.

Consider the database table user, which contains a total of seven data. We need to retrieve the data of id=7. The SQL syntax is:

Select\ * from user where id=7

The hash algorithm first calculates the physical address addr=hash (7) = 4231 of the data stored in id=7, while the physical address mapped by 4231 is 0x77, which is the physical address of the amount of data stored by id=7. The corresponding user_name='g' data can be found through this independent address. This is the calculation process of quickly retrieving data by hashing algorithm.

But the hash algorithm has a data collision problem, that is, the hash function may calculate the same result for different key accountants, for example, hash (7) may be the same as hash (1999), that is, different key maps to the same result, which is the collision problem. A common way to solve the collision problem is the chain address method, that is, the collision data are connected by a linked list. After calculating the hash value, you also need to check whether the hash value has a collision data linked list, or traverse all the way to the end of the list until you find the data corresponding to the real key.

From the analysis of the time complexity of the algorithm, the time complexity of the hash algorithm is O (1), and the retrieval speed is very fast. For example, to find the data of id=7, the hash index only needs to be calculated once to obtain the corresponding data, and the retrieval speed is very fast. But Mysql doesn't take hashing as its underlying algorithm. Why?

Because considering that a common means of data retrieval is scope lookup, such as the following SQL statement:

Select\ * from user where id\ > 3

For the above statement, what we want to do is to find the data with id > 3, which is a typical scope lookup. If you use an index implemented by a hash algorithm, how do you do range lookup? A simple idea is to load all the data into memory at once, and then filter the data within the target range in memory. But the method of searching in this range is too cumbersome and inefficient.

Therefore, although the index implemented by the hash algorithm can retrieve data quickly, it can not do efficient range search, so the hash index is not suitable as the underlying index of Mysql.

Binary search tree (BST)

A binary search tree is a data structure that supports fast data lookup, as shown in the figure below:

The time complexity of binary search tree is O (lgn). For example, for the above binary tree structure, we need to calculate and compare three times to retrieve id=7 data, which saves half the time compared with directly traversing the query, and can achieve high-speed retrieval in terms of retrieval efficiency. In addition, can the structure of the binary tree solve the range lookup function that the hash index cannot provide?

The answer is yes. Looking at the figure above, the leaf nodes of the binary tree are arranged in order, in ascending order from left to right. If we need to find the data with id > 5, then we can take out the node with node 6 and its right subtree, and the range search is relatively easy to implement.

However, the ordinary binary search tree has a fatal disadvantage: in extreme cases, it will be reduced to a linear linked list, binary search will also be reduced to ergodic search, time complexity will be reduced to O (N), and the retrieval performance will decline sharply. For example, in the following case, the binary tree has been extremely unbalanced, has been reduced to a linked list, and the retrieval speed has been greatly reduced. At this point, the number of calculations required to retrieve id=7 data has become 7.

In the database, data self-increment is a very common form, for example, the primary key of a table is id, while the primary key is generally self-increasing by default. If the data structure such as binary tree is used as the index, then the problem of linear search caused by the unbalanced state described above is bound to occur. Therefore, the simple binary search tree has the problem of poor retrieval performance caused by imbalance, which can not be directly used to implement the underlying index of Mysql.

AVL tree and red-black tree

The binary search tree has the problem of imbalance, so scholars propose that through the automatic rotation and adjustment of the tree nodes, so that the binary tree can always maintain the basic balance, the best search performance of the binary search tree can be maintained. The binary trees with self-adjusting equilibrium state based on this idea are AVL tree and red-black tree.

First of all, the red-black tree is briefly introduced, which is a tree structure that automatically adjusts the shape of the tree. For example, when the binary tree is in an unbalanced state, the red-black tree will automatically turn left and right nodes and change color, and adjust the shape of the tree to maintain the basic balance state (time complexity is O (logn)), which ensures that the search efficiency will not be significantly reduced. For example, if the data node is inserted in ascending order from 1 to 7, the ordinary binary search tree will degenerate into a linked list, but the red-black tree will constantly adjust the shape of the tree to maintain its basic balance, as shown in the following figure. The following red-black tree to find id=7 to compare the number of nodes is 4, still maintain a good binary tree search efficiency.

The red-black tree has a good average search efficiency, and there is no extreme O (n) case, so can the red-black tree be implemented as the underlying index of Mysql? In fact, there are also some problems with red-black trees. Take a look at the following example.

The red-black tree inserts 1-7 nodes sequentially, and the number of nodes that need to be calculated when looking for id=7 is 4.

The red-black tree inserts 1-16 nodes in sequence, and the number of nodes to find id=16 needs to be compared 6 times. If you look at the shape of the tree, is it true that when the data is inserted sequentially, the shape of the tree is always in a "right-leaning" trend? Fundamentally, the red-black tree does not completely solve the "right-leaning" trend of the binary search tree, although this "right-leaning" trend is far less exaggerated than the degradation of the binary search tree into a linear linked list, but the basic primary key in the database is self-increasing operation. the primary key is generally millions of millions, if the red-black tree has this problem, it is also a huge consumption for search performance, our database can not stand this meaningless waiting.

Now consider another more stringent self-balanced binary tree, the AVL tree. Because the AVL tree is an absolutely balanced binary tree, it consumes more performance in adjusting the shape of the binary tree.

The AVL tree inserts 1-7 nodes sequentially, and the number of times to find the nodes to be compared by id=7 is 3.

The AVL tree inserts 1-16 nodes sequentially, and the number of nodes needed to compare to find id=16 is 4. In terms of search efficiency, the search speed of AVL trees is higher than that of red-black trees (4 comparisons for AVL trees and 6 comparisons for red-black trees). Judging from the shape of the tree, the AVL tree does not have the problem of "leaning to the right" of the red-black tree. In other words, a large number of sequential insertions will not lead to a decline in query performance, which fundamentally solves the problem of red-black trees.

Summarize the advantages of the AVL tree:

Good lookup performance (O (logn)), and there is no extremely inefficient lookup.

Can achieve range search, data sorting.

It seems that the AVL tree is a good data structure for data lookup, but the AVL tree is not suitable for indexing data structures in Mysql databases, because consider this problem:

The bottleneck of database query data is disk IO. If we use AVL tree, only one data is stored in each tree node. We can only retrieve data from one node at a time by disk IO and load it into memory. For example, to query id=7, we have to do disk IO three times, which is time-consuming. So when we design database indexes, we need to first consider how to minimize the number of disk IO.

Disk IO has a feature, that is, the time it takes to read 1B data from disk and 1KB data is basically the same, we can according to this idea, we can store as much data as possible on a tree node, a disk IO will load more data to memory, this is the design principle of B-tree, B + tree.

B tree

In the following B-tree, each node is limited to store a maximum of two key, and a node will automatically split if it has more than two key. For example, the following one stores seven data B-trees, and only needs to query two nodes to know the specific location of the id=7 data, that is, twice disk IO can query the specified data, which is better than the AVL tree.

The following is a B-tree that stores 16 data. Similarly, each node stores a maximum of 2 key. Querying id=16 this data needs to be queried and compared with 4 nodes, that is, after 4 disk IO. It looks like the query performance is the same as the AVL tree.

But considering that disk IO takes about the same time to read one data and 100th data, our optimization idea can be changed to: read as much data into memory as possible in disk IO. This is directly reflected in the structure of the tree, that is, the amount of key that each node can store can be increased appropriately.

When we set the limited number of key for a single node to 6, a B-tree with 7 data is stored, and the disk IO for querying id=7 this data is 2 times.

A B-tree stores 16 data, and the disk IO for querying id=7 this data is 2 times. The number of disk IO is reduced to half compared to the AVL tree.

Therefore, B-tree is a very good choice for the selection of database index data structure. To sum up, B-tree as a database index has the following advantages:

Excellent retrieval speed, time complexity: the search performance of B-tree is equal to O (h*logn), where h is the height of the tree and n is the number of keywords per node.

Speed up retrieval with as few disk IO as possible

Scope lookup can be supported.

5. B+ tree

What's the difference between B-tree and B + tree?

First, B-tree stores data in a node, while B + tree stores indexes (addresses), so a node in B-tree can not store many data, but a node in B + tree can store many indexes, and B + tree leaf nodes store all the data.

Second, the leaf nodes of the B+ tree are concatenated with a linked list in the data phase to facilitate range search.

Through the comparison of B-tree and B + tree, we can see that the B + tree node stores indexes. When the storage capacity of a single node is limited, a single node can also store a large number of indexes, which reduces the height of the whole B + tree and reduces the disk IO. Secondly, the leaf node of the B+ tree is the place where the real data is stored, and the leaf node is connected by a linked list, which itself is orderly and more efficient in finding the data range. Therefore, the index of Mysql refers to the B+ tree, which has a very good performance in search efficiency and scope search.

Implementation of Innodb engine and Myisam engine

The underlying data engine of Mysql is designed in the form of plug-ins, the most common of which are Innodb engine and Myisam engine. Users can choose different engines as the underlying engine of Mysql data table according to their personal needs. We have just analyzed that the data structure of B+ tree as the index of Mysql is very suitable, but how the data and indexes are organized also needs some design, and the different design concepts also lead to the emergence of Innodb and Myisam, each showing a unique performance.

Although MyISAM has excellent data lookup performance, it does not support transaction processing. The biggest feature of Innodb is that it supports ACID-compatible transactions, and it supports row-level locks. Mysql can specify an engine when creating a table, such as the following example, which specifies Myisam and Innodb as the data engines for user tables and user2 tables, respectively.

After executing these two instructions, the following files appear in the system, indicating that the data and indexes of the two engines are not organized in the same way.

The files generated by Innodb after creating the table are:

Frm: a statement to create a table

Idb: data in the table + index file

The files generated by Myisam after creating the table are

Frm: a statement to create a table

MYD: the data file in the table (myisam data)

MYI: index file in the table (myisam index)

From the generated files, the underlying data and indexes of the two engines are not organized in the same way. The MyISAM engine separates the data from the index, one file for each person, which is called nonclustered index; the Innodb engine puts the data and index in the same file, which is called clustered index. From the perspective of the underlying implementation, we will analyze how the two engines rely on the data structure of the B+ tree to organize the engine implementation.

The underlying implementation of the MyISAM engine (nonclustered index)

MyISAM uses a nonclustered index, where the data and the index fall on two different files. When building the table, MyISAM uses the primary key as the KEY to build the main index B + tree, and the leaf node of the tree stores the physical address of the corresponding data. After we get this physical address, we can directly locate the specific data record in the MyISAM data file.

When we add an index to a field, we also generate the index tree of the corresponding field, and the leaf node of the index tree of the field also records the physical address of the corresponding data. then also take this physical address to the data file to locate the specific data record.

The underlying implementation of the Innodb engine (clustered index)

InnoDB is a clustered index, so both data and indexes are stored in the same file. First of all, InnoDB will build an index Btree based on the primary key ID as KEY, as shown in the following figure on the left, while the leaf node of the B+ tree stores the data corresponding to the primary key ID. For example, when executing the statement select * from user_info where id=15, InnoDB will query the primary key ID index Btree to find the corresponding user_name='Bob'.

This is why InnoDB automatically builds the primary key ID index tree when creating a table, which is why Mysql requires that a primary key be specified when creating a table. How does InnoDB build an index tree when we index a field in the table? For example, if we want to index the user_name field, InnoDB will build the user_name index B + tree, the node stores the user_name KEY, and the leaf node stores the primary key KEY. Notice that the leaf stores the primary key KEY! After getting the primary key KEY, InnoDB will go to the primary key index tree to find the corresponding data according to the primary key KEY just found in the user_name index tree.

The question is, why does InnoDB store specific data only in the leaf nodes of the primary key index tree, but not in other index trees, instead of finding the primary key first and then finding the corresponding data in the primary key index tree?

It's really simple, because InnoDB needs to save storage space. There may be many indexes in a table, and InnoDB generates an index tree for each indexed field. If the index tree of each field stores specific data, then the index data file of the table becomes very large (the data is extremely redundant). From the point of view of saving disk space, it is really not necessary to store specific data in every field index tree. Through this seemingly "superfluous" step, a huge amount of disk space is saved at the expense of less query performance. it's worth it.

When comparing the characteristics of InnoDB and MyISAM, it is said that MyISAM query performance is better, and we can also see the reason from the design of the index file data file above: MyISAM can directly locate the data record after finding the physical address, but after InnoDB queries the leaf node, it also needs to query the primary key index tree again to locate the specific data. It is equivalent to MyISAM to find the data in one step, but InnoDB takes two steps, so of course MyISAM query performance is higher.

This paper first discusses which data structure is more suitable for the implementation of the underlying index of Mysql, and then introduces the underlying implementation of MyISAM and InnoDB, two classical data engines of Mysql. Finally, let's summarize when you need to index the fields in your table:

Fields that are more frequently used as query criteria should be indexed

Fields with poor uniqueness are not suitable for creating separate indexes, even if the field is frequently used as a query condition

Fields that are updated very frequently are not suitable for index creation.

On how to understand the underlying principles of MySQL index to share here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.