Why the index structure of MySQL uses B+ tree 07/15 Update SLTechnology News&Howtos

Why the index structure of MySQL uses B+ tree

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces "why MySQL index structure uses B + tree". In daily operation, I believe that many people have doubts about why MySQL index structure uses B + tree. Xiaobian consulted all kinds of data and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the question of "why MySQL index structure uses B + tree". Next, please follow the editor to study!

Preface

In MySQL, both Innodb and MyIsam use the B+ tree as the index structure (other indexes such as hash are not considered here). This article will start with the most common binary search tree, and gradually explain the problems solved by various trees and the new problems they face, so as to explain why MySQL chooses B+ tree as its index structure.

Catalogue

Unbalanced and binary search trees (BST)

2. Balanced binary tree (AVL): rotation time-consuming

The red and black tree: the tree is too high

IV. B-tree: born for disk

5. B + tree

6. Feel the power of B + tree

VII. Summary

Unbalanced and binary search trees (BST)

Binary search tree (BST,Binary Search Tree), also known as binary sort tree, needs to be satisfied on the basis of binary tree: the value of all nodes on the left subtree of any node is not greater than the value of the root node, and the value of all nodes on the right subtree of any node is not less than the value of the root node. The following is a BST (picture source).

Storing data in BST is a common choice when quick lookups are needed, because the query time depends on the height of the tree, and the average time complexity is O (lgn). However, the BST may grow crooked and become unbalanced, as shown in the following figure (image source), when the BST is reduced to a linked list and the time complexity is reduced to O (n).

In order to solve this problem, balanced binary tree is introduced.

2. Balanced binary tree (AVL): rotation time-consuming

The AVL tree is a strictly balanced binary tree, and the height difference between the left and right subtrees of all nodes can not exceed 1 lgn AVL tree search, insertion and deletion is O (AVL) in average and worst case.

The key to balancing the AVL is the rotation operation: insertions and deletions may upset the balance of the binary tree, which requires one or more tree rotations to rebalance the tree. When inserting data, you only need at most one rotation (single rotation or double rotation); but when deleting data, it will cause a tree imbalance, and AVL needs to maintain the balance of all nodes on the path from the deleted node to the root node, with a rotation of the order of magnitude O (lgn).

Because of the time-consuming rotation, the AVL tree is inefficient in deleting data; when there are more delete operations, the cost of maintaining balance may outweigh its benefits, so AVL is not widely used in practice.

The red and black tree: the tree is too high

Compared with AVL trees, red-black trees do not pursue a strict balance, but a general balance: just make sure that the longest possible path from root to leaf is no more than twice the length of the shortest possible path. From the perspective of implementation, the biggest feature of the red-black tree is that each node belongs to one of two colors (red or black), and the division of node color needs to meet specific rules (specific rules). Examples of red-black trees are as follows (picture source):

Compared with the AVL tree, the query efficiency of the red-black tree is lower, because the tree is less balanced and higher in height. However, the deletion efficiency of the red-black tree is greatly improved, because the red-black tree introduces color at the same time, when inserting or deleting data, only O (1) rotation and discoloration are needed to ensure the basic balance, and there is no need for O (lgn) rotation like the AVL tree. Generally speaking, the statistical performance of red-black trees is higher than that of AVL.

Therefore, in practical applications, the use of AVL trees is relatively small, while red-black trees are widely used. For example, TreeMap in Java uses red-black trees to store sort key-value pairs; HashMap in Java8 uses linked lists + red-black trees to solve hash conflicts (linked lists are used when there are fewer conflicting nodes, and red-black trees are used when there are more conflicting nodes).

For cases where the data is in memory (such as TreeMap and HashMap above), the performance of the red-black tree is excellent. But for data in secondary storage devices such as disks (such as databases such as MySQL), red-black trees are not good at it because they are still too tall. When the data is on disk, disk IO will become the biggest performance bottleneck, and the design goal should be to minimize the number of IO; and the higher the height of the tree, the more times of IO needed to add, delete, modify and query, which will seriously affect performance.

IV. B-tree: born for disk

B-tree, also known as B-tree (where-is not a minus sign), is a multi-channel balanced search tree designed for disks and other auxiliary storage devices. compared with binary trees, each non-leaf node of B-tree can have multiple subtrees. Therefore, when the total number of nodes is the same, the height of B-tree is much smaller than that of AVL-tree and red-black tree (B-tree is a "short fat man"), and the number of disk IO is greatly reduced.

The most important concept of defining a B-tree is the Order. For an m-order B-tree, the following conditions need to be met:

Each node contains a maximum of m child nodes. If the root node contains child nodes, it contains at least 2 child nodes; except for the root node, each non-leaf node contains at least 2 child nodes. A non-leaf node with k child nodes will contain k-1 records. All leaf nodes are in the same layer.

It can be seen that the definition of B-tree is mainly limited to the number of child nodes and records of non-leaf nodes.

The following is an example of a third-order B-tree (image source):

The advantage of B-tree is not only the height of the tree, but also the use of the principle of access locality. The so-called locality principle means that when a data is used, the data near it has a high probability of being used in a short time. B-tree stores data with similar keys on the same node, and when accessing one of the data, the database reads the entire node into the cache; when its adjacent data is then accessed, it can be read directly in the cache without the need for disk IO; in other words, B-tree has a higher cache hit rate.

B-tree has some applications in database, such as the index of mongodb uses B-tree structure. However, in many database applications, a variant of B-tree is used.

5. B + tree

B + tree is also a multi-path balanced search tree. The main difference between B + tree and B tree is:

Each node in the B-tree (including leaf nodes and non-leaf nodes) stores real data. In B + tree, only leaf nodes store real data, and non-leaf nodes only store keys. In MySQL, the real data mentioned here may be all the data of the row (such as Innodb's clustered index), or it may just be the primary key of the row (such as Innodb's secondary index), or the address where the row is located (such as MyIsam's non-clustered index). A record in the B-tree will only appear once and will not be repeated, while the keys of the B + tree may be repeated-certainly in the leaf node or in the non-leaf node. The leaf nodes of the B+ tree are linked by a two-way linked list. The number of records of the non-leaf nodes in the B-tree is 1 less than the number of child nodes, while the number of records in the B + tree is the same as the number of child nodes.

As a result, compared with B-tree, B + tree has the following advantages:

Fewer IO times: the non-leaf nodes of the B + tree contain only keys, not real data, so each node stores much more records than B (that is, order m is larger), so the height of the B + tree is lower and the number of IO required for access is less. In addition, because each node stores more records, the principle of access locality is better utilized and the cache hit rate is higher. It is more suitable for scope query: when querying the scope in the B-tree, we first find the lower limit of the search, and then traverse the B-tree in the middle order until the upper limit of the search is found, while the scope query of the B + tree only needs to traverse the linked list. More stable query efficiency: the query time complexity of B-tree is between 1 and tree height (corresponding to the root node and leaf node, respectively), while the query complexity of B + tree is stable to tree height, because all data is in the leaf node.

The B+ tree also has a disadvantage: because the keys are repeated, they take up more space. However, compared with the performance advantages, spatial disadvantages are often acceptable, so B + trees are more widely used in databases than B trees.

6. Feel the power of B + tree

As mentioned earlier, compared with binary trees such as red-black trees, the biggest advantage of B-tree / B + tree is that the height of the tree is smaller. In fact, for Innodb's B+ index, the height of the tree is generally 2-4 layers. Here are some specific estimates.

The height of the tree is determined by the order, the larger the order, the lower the tree, and the size of the order depends on how many records each node can store. Each node in Innodb uses a page (page) with a page size of 16KB, in which metadata takes up only about 128bytes (including file management header information, page header information, and so on), and most of the space is used to store data.

For non-leaf nodes, the record contains only the key of the index and a pointer to the next layer node. Assuming that each non-leaf node page stores 1000 records, each record takes up about 16 bytes; this assumption is reasonable when the index is an integer or a shorter string. By extension, we often hear suggestions that the length of index columns should not be too large, for this reason: the index column is too long, the number of records per node is too small, the tree is too high, and the effect of the index is greatly reduced. and the index wastes more space. For leaf nodes, the record contains the key and value of the index (the value may be the primary key of the row, a row of complete data, and so on, see above), and the amount of data is larger. It is assumed that 100 records are stored on each leaf node page (in fact, this number may be less than 100 when the index is clustered; when the index is secondary, the number may be much greater than 100; it can be estimated based on the actual situation).

For a 3-layer B + tree, the first layer (root node) has one page, which can store 1000 records; the second layer has 1000 pages, which can store 10001000 records; and the third layer (leaf node) has 10001000 pages, each page can store 100,000,100 records, so 1000,100 records can be stored. For binary trees, it takes about 26 layers to store 100 million records.

VII. Summary

Finally, summarize the problems solved by various trees and the new problems they face:

Binary search tree (BST): solves the basic problem of sorting, but may degenerate into a linked list because the balance cannot be guaranteed

Balanced binary tree (AVL): the problem of balance is solved by rotation, but the rotation operation is too inefficient

Red-black tree: the problem of low rotation efficiency of AVL is solved by abandoning strict balance and introducing red-black nodes, but in scenarios such as disks, the tree is still too high and the number of IO is too many.

B-tree: the problem that the tree is too high is solved by changing the binary tree into a multi-path balanced search tree.

B+ tree: on the basis of the B-tree, the non-leaf node is transformed into a pure index node that does not store data, which further reduces the height of the tree; in addition, the leaf node is connected into a linked list using pointers, and the range query is more efficient.

At this point, the study on "why the index structure of MySQL uses B + tree" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.