What is the underlying implementation principle of index in MySQL? 07/16 Update SLTechnology News&Howtos

What is the underlying implementation principle of index in MySQL?

2025-07-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article shows you what the underlying implementation principle of the index in MySQL is, the content is concise and easy to understand, and it will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

The underlying implementation principle of MySQL Index

MySQL officially defines index as: index (Index) is a data structure that helps MySQL to obtain data efficiently. Extract the trunk of the sentence and you can get the essence of the index: the index is the data structure.

We know that database query is one of the most important functions of the database. We all want to query data as fast as possible, so the designer of the database system will optimize it from the point of view of query algorithm. Of course, the most basic query algorithm is sequential search (linear search). This algorithm with O (n) complexity is obviously bad when there is a large amount of data. fortunately, the development of computer science provides many better search algorithms, such as binary search (binary search), binary tree search (binary tree search) and so on.

If you analyze it a little bit, you will find that each search algorithm can only be applied to a specific data structure, for example, binary search requires that the retrieved data is ordered, while binary tree search can only be applied to binary search trees. but the organizational structure of the data itself cannot fully satisfy a variety of data structures (for example, it is theoretically impossible to organize both columns sequentially at the same time), so outside the data The database system also maintains data structures that meet specific lookup algorithms, which refer to (point to) data in some way, so that advanced lookup algorithms can be implemented on these data structures. This kind of data structure is the index.

The figure above shows one possible way of indexing. On the left is the data table, which has two columns and seven records, and the leftmost one is the physical address of the data record (note that logically adjacent records are not necessarily physically adjacent on disk). To speed up the Col2 lookup, you can maintain a binary lookup tree shown on the right, with each node containing an index key and a pointer to the physical address of the corresponding data record, so that you can use binary lookup to obtain the corresponding data within the complexity of (2).

Although this is a genuine index, the actual database system is rarely implemented using a binary search tree or its evolved red black tree (red-black tree), for reasons described below.

Binary sort tree

Before introducing the B tree, let's take a look at another magical tree-binary sort Tree (Binary Sort Tree). First of all, it is a tree. The description of "binary" is already obvious, that is, a branch on the tree is split into two forks, so it is recursively a binary tree (shown in the following figure). The nodes on this tree are sorted, and the specific sorting rules are as follows:

If the left subtree is not empty, the value of all nodes on the left subtree is less than the value of its root node.

If the right subtree is not empty, the value of all nodes on the right word count is greater than the value of its root node.

Its left and right subtrees are also binary ordinal numbers (recursively defined).

As can be seen from the figure, when the binary sort tree organizes data, it is more convenient to find, because each time it passes through a node, the possibility can be reduced by half at most, but in extreme cases, all nodes are on the same side. intuitively, it is a straight line, so the efficiency of this kind of query is relatively low, so it is necessary to balance the height of the left and right subtrees of the binary tree. So there is a balanced binary tree (Balenced Binary Tree).

The so-called "balance" means that the height of each branch of the tree is uniform, and the absolute difference between the height of the left subtree and the right subtree is less than 1, so that there will not be a particularly long branch. Therefore, when searching in such a balanced tree, the total number of nodes compared does not exceed the height of the tree, which ensures the efficiency of the query (the time complexity is O (logn)).

B tree

Or look at the picture more clearly, the picture shows that B-tree is actually a balanced multi-fork search tree, that is to say, up to m forks (m > = 2), we call it m-order b-tree, in order to reflect the conscience of this blog, different from other places can see the second-order B-tree, here specially drew a fifth-order B-tree.

In general, the m-order B-tree satisfies the following conditions:

Each node can have at most m subtrees.

The root node has at least two nodes (or in extreme cases, one root node for a tree, a single-celled organism, that is, a root, a leaf, and a tree).

For non-root and non-leaf nodes, there are at least three Ceil subtrees (Ceil denotes an upward rounding, and the fifth-order B-tree in the graph, each node has at least three subtrees, that is, at least three forks).

The information in the non-leaf node includes [nrecom A0 Magi K1 Magi A1 Jr K2, … , Kn,An], where n represents the number of keywords saved in the node, K is the keyword and Ki

Every path from the root to the leaf has the same length, that is, the leaf nodes are on the same layer, and these nodes have no information, in fact, these nodes indicate that the specified value cannot be found, that is, the pointer to these nodes is empty.

The query process of B-tree is similar to that of binary sort tree, comparing each node in turn from the root node, because the keywords and left and right subtrees in each node are ordered, so as long as you compare the keywords in the node, or you can quickly find the specified keyword along the pointer, if the search fails, it will return the leaf node, that is, null pointer.

Starting from the root node P, K is positioned before P and enters the left pointer.

In the left subtree, comparing C, F, J and M in turn, it is found that K is between J and M.

Follow the pointer between J and M, continue to access the subtree, and compare it in turn, and find that the first keyword K is the value of the specified lookup.

The simple pseudo algorithm for B-tree search is as follows:

BTree_Search (node, key) {

If (node = = null) return null

Foreach (node.key)

{

If (node.key [I] = = key) return node.data [I]

If (node.key [I] > key) return BTree_Search (Point [I]-> node)

}

Return BTree_Search (Point [I + 1]-> node)

}

Data = BTree_Search (root, my_key)

The characteristics of B-tree can be summarized as follows:

Keyword sets are distributed throughout the tree.

Any keyword appears and only appears in one node.

The search may end at a non-leaf node.

Its search performance is equivalent to doing a binary search in the keyword set.

When B-tree inserts and deletes new data records, it will destroy the nature of B-Tree, because when inserting and deleting, it needs to split, merge, transfer and other operations to maintain the B-Tree nature of the tree.

Plus version-B+ Tree

As an enhanced version of B-tree, the difference between B + tree and B-tree is:

The nodes with n subtrees contain n keywords (some also think that it is the keyword of n Mel).

All keywords are stored on the leaf node, and the leaf node itself is connected from small to large according to the keyword.

A non-leaf node can be regarded as the index part, and the node contains only the largest (or smallest) keywords in its subtree (root node).

The search process of a B + tree is similar to that of a B tree, except that if the keyword on a non-leaf node is equal to a given value, it does not terminate, but continues to follow the pointer until the leaf node is located. Therefore, in the B+ tree, regardless of whether the search is successful or not, each search takes a path from the root to the leaf node.

The characteristics of the B+ tree are as follows:

All keywords are stored on the leaf section, and the keywords in the linked list happen to be ordered.

It is impossible for a non-leaf node to hit and return.

A non-leaf node is equivalent to an index of a leaf node, and a leaf node is equivalent to a data layer that stores (keywords) data.

It is more suitable for file indexing system.

B+Tree with sequential access pointers

The B+Tree structures commonly used in database systems or file systems are optimized on the basis of classical B+Tree, adding sequential access pointers.

As shown in the figure above, adding a pointer to the adjacent leaf node in each leaf node of the B+Tree forms a B+Tree with sequential access pointers. The purpose of this optimization is to improve the performance of interval access. For example, in figure 4, if you want to query all data records whose key is from 18 to 49, when 18 is found, you can access all data nodes at once simply by traversing the node and pointer order, which greatly mentions the efficiency of interval query.

Comparison between binary tree and red-black tree

From the above, we find that the red-black tree has made some progress compared with the binary tree, but the red-black tree still has some problems: if the amount of data is large, the depth of the red-black tree will be very deep, that is to say, the depth is uncontrollable. in this way, it will be time-consuming to find data.

From the above, we find that BTree has improved a little bit, the query speed has increased, and the storage capacity has not been affected. Of course, some people may think this way, so why don't we put all the data in one node, so the depth is 1?

Of course not! java takes data like this: java program-> CPU--- > memory-> hard disk, while the interaction between memory and hard disk is limited in size, which is about 4k per page of data, so you can't get all the data in one node. Generally speaking, the node will try to store 4K capacity in advance.

See here, we know (4K = node; node = small node * capacity of small node) A node is 4K, and there are several small nodes in the node, that is to say, as long as the data capacity of each of our small nodes is smaller, then there will be more nodes that can be saved.

B+Tree increases the degree (small nodes) by not placing the data on non-leaf nodes, usually more than a hundred to a depth of 3 to 5, thus reducing the number of queries. Moreover, there will be pointers between leaf nodes, and the data is increasing, which makes our range search can be found through pointer connection, instead of looking down from the upper node one by one.

What is the underlying implementation principle of indexing in MySQL? have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.