What is the B+Tree index principle of Mysql? 07/06 Update SLTechnology News&Howtos

What is the B+Tree index principle of Mysql?

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

First of all, the correct creation of the appropriate index is the basis to improve the performance of database query.

What is the index?

An index is a decentralized data structure created to speed up the retrieval of data rows in a table.

How does the index work?

As shown in the figure above, if there is a sql statement select * from teacher where id = 101, if we want to find this record without an index, we need to do a full table scan to match the data with id = 101. If we have an index, we can quickly find the address of the row corresponding to 101 recorded on disk through the index, and then retrieve the corresponding row data according to the given address.

Why do MYSQL databases use B+TREE as the data structure of the index?

For the accelerated retrieval of data, the first thing to think of is the binary tree, the search time complexity of the binary tree can reach O (log2 (n)). Take a look at the storage structure of the binary tree:

Binary tree search is equivalent to a binary search. Binary search can greatly improve the efficiency of the query, but it has a problem: the binary tree takes the first inserted data as the root node, such as the above figure, if you only look at the right side, you will find that it is a linear linked list structure. If our current data contains only 1, 2, 3, 4, 5, 6, the following will happen:

If the data we want to query is 6, we need to traverse all the nodes to find 6, that is, equivalent to a full table scan, because of this problem, the binary search tree is not suitable for the data structure as an index.

Based on this deduction, in order to solve the problem of linear linked list, it is easy to think of balanced binary search tree. Let's see what a balanced binary tree looks like:

The balanced binary search tree is defined as: the height difference of the child nodes of the node cannot exceed 1, such as the node 20 in the image above, the height of the left node is 1, the height of the right node is 0, and the difference is 1, so the above figure does not violate the definition, it is a balanced binary tree. The way to ensure the balance of the binary tree is left-handed, right-handed and other operations. as for left-handed and right-handed operations, you can search for relevant knowledge on your own.

If the balanced binary tree in the above figure holds the id index, now from the data with id = 8, first load the root node into memory, compare with 8 and 10, and find that 8 is smaller than 10. Continue to load the left subtree of 10. Load 5 into memory, compare with 8 and 5, and similarly, load the right subtree of 5 nodes. At this point, the hit is found, and now you want to load the data corresponding to the index with id 8.

How do I find the data corresponding to the index?

There are generally two ways for indexes to save data. The first is to save all the specific contents of the row data of id = 8 in the data area of the node. On the other hand, the data area holds the disk address where the data is actually saved.

At this point, the balanced binary tree solves the problem of linear linked lists, and the efficiency of data query seems to be OK, basically reaching O (log2 (n)), so why doesn't mysql choose such a data structure? what kind of problems does he have?

Problem 1: search is inefficient. Generally speaking, in the tree structure, the depth of the data determines the number of IO when searching. As in the figure above, you need to search for data with id = 8 for 3 times of IO. When the amount of data reaches millions, the height of the tree will be very scary.

Question 2: the query is unstable. If the query data falls on the root node, only one IO is needed. If it is a leaf node or a branch node, it will require multiple IO.

Problem 3: the node stores too little data. It does not make good use of the operating system and disk data exchange features, and does not make good use of the pre-reading ability of disk IO. Because a data exchange between the operating system and the disk is in pages, one page = 4K, that is, each time the IO operating system loads 4K data into memory. However, the structure of each node in the binary tree saves only one keyword, one data area, and two child node references, which can not fill the 4K content. Fortunately, only one keyword is loaded in the IO operation. When the height of the tree is very high and the search keyword is located in the leaf node or branch node, it takes many times to IO a keyword.

Is there a structure that can solve the problem of binary tree?

Yes, multi-path balanced search tree: (Balance Tree):

B Tree is an absolutely balanced tree with all leaf nodes at the same height, as shown in the following figure:

What are the advantages of B Tree, and how do you solve some problems?

Looking at the definition first, the figure above shows a 2-3 tree (each node stores 2 keywords and has 3 paths). The multi-way balanced search tree means multi-forked. As can be seen from the above figure, the number of keywords and paths saved by each node is as follows:

Number of keywords = number of paths-1.

Suppose you want to find the data with id = 28 from the above figure, the B TREE search process is as follows:

First of all, the root node is loaded into memory, and two keywords of 17Power35 are loaded. The judgment rule is as follows:

After hitting 28 according to the above rules, then load the data corresponding to 28, and then find the corresponding data area of 28. The data area stores specific data or pointers to data.

Why can this structure solve the problem of balanced binary tree?

MYSQL can make good use of the interaction between the operating system and the disk. In order to make good use of the pre-reading ability of the disk, the page size is 16K, that is, the size of a node (disk block) is set to 16K, and IO loads the contents of a node (16K) into memory at a time. Here, assuming that the keyword type is int, that is, 4 bytes, if the data area corresponding to each keyword is also 4 bytes, and without considering the reference of child nodes, then each node in the above figure can store about (16 * 1000) / 8 = 2000 keywords, then a total of 2001 paths. For the binary tree, the three-layer height can save up to 7 keywords, while for this kind of B-tree with 2001 paths, the number of keywords that can be searched by the three-layer height is much larger than that of the binary tree.

In the process of B TREE to ensure the balance of the tree, each keyword change will lead to great changes in the structure, this process is a waste of time, so to create an index must create an appropriate index, rather than all the fields to create an index, the creation of redundant indexes will only increase performance consumption when adding, deleting, and modifying data.

Since B-tree has solved the problem well, why does MYSQL still use B+TREE?

First, let's see what B+TREE is like. B+TREE is a variety of B TREE. In B + tree species, the relationship between the number of paths and the number of keywords is no longer valid. In B+TREE, the data retrieval rule uses a left closed interval, and the relationship between the number of paths and the number of keys is 1: 1, as shown in the following figure:

If the above figure is indexed with ID, and if you are searching for data with id = 1, the search rules are as follows:

According to the above rules, the data is finally hit in the leaf node, and the real data is obtained according to the data area of node 1 in the leaf node.

What is the difference between B TREE and B+TREE?

1. B+TREE keyword search uses the left closed interval, the reason for using the left closed interval is because he wants the best to support self-increasing id, which is also the original intention of mysql design. That is, if id = 1 hits, it continues to look down until the 1 in the leaf node is found.

2. B+TREE root node and branch node have no data area, and the data corresponding to keywords are only stored in the leaf node. That is, only the keyword data area in the leaf node will save the real data content or the address of the content. In B tree, if the root node is hit, the data will be returned directly. And in B+TREE, the leaf node does not save references to the child nodes.

3. The B+TREE leaf nodes are arranged sequentially, and the adjacent nodes have the relationship of sequential references, such as the pointers between the leaf nodes in the above figure.

Why did MYSQL finally choose B+TREE?

1. B+TREE is a variant of B TREE, which B TREE can solve, and B+TREE can also solve (reduce the height of the tree and increase the amount of data stored by nodes).

2. B+TREE is more capable of scanning database and table. If we want to scan the data table according to the index and scan B TREE, we need to traverse the whole tree, while B+TREE only needs to traverse all its leaf nodes (there are references between leaf nodes).

3. B+TREE disk is more capable of reading and writing. Its root node and branch node do not save the data area. When all root nodes and branch nodes are of the same size, more keywords are saved than B TREE. The leaf node does not save child node references. Therefore, B+TREE reads and writes once to disk loads with more keywords than B TREE.

4. B+TREE has stronger sorting ability, as can be seen in the figure above, B+TREE naturally has sorting function.

5. The efficiency of B+TREE query is more stable, and the number of IO queries must be stable every time you query data. Of course, everyone's understanding of this is different, because in B TREE, if the root node hits and returns directly, it is indeed more efficient.

Specific landing form of MYSQL B+TREE

What is mainly explained here is the implementation of MYSQL's two storage engines (MYISAM and INNODB) with different B+TREE index structures. First, find the folder where MYSQL stores the data and see how mysql saves the data:

Enter this directory, where all the databases are saved, and then go to a specific database directory. Here, there are a variety of data storage engines, and MYISAM and innodb are explained here, as shown in the figure:

MYISAM Storage engine Index:

As you can see from the figure, using the MYISAM storage engine to store database data, there are three files:

Frm, the definition file of the table. MYD: a data file in which all the data is saved. MYI: index file.

In the MYISAM storage engine, the relationship between data and indexes is as follows:

How do you find the data? If you want to query the data of id = 101, first find the node with id = 101according to the id index file (left), get the disk address where the data is really saved through the data area of this node, and then load the corresponding records from the MYD data file (such as the right image above) through this address.

If there is more than one index, it appears as follows:

So in the MYISAM storage engine, the primary key index and the secondary index are at the same level, and there is no distinction between primary and secondary.

Innodb Storage engine:

First take a look at the concept of a clustered index, which is defined as the physical order of the data in the rows of the database table and the logical order of the key values.

Innodb aggregates the storage of organizational data using the primary key as the index. Let's take a look at how Innodb organizes the data.

Innodb has only two files, the Frm file: the definition file of the table, and the Ibd file, and there is no file that specifically stores the data. The data is aggregated and stored in the primary key, and the real data is stored in the leaf node. Innodb was originally designed to think that the primary key is the primary index. It is shown in the following figure:

For example, in the image above, the data area of the leaf node stores the real data. When retrieving through the index, if you hit the leaf node, you can fetch the travel data directly from the leaf node. Before the mysql5.5 version, the MYISAM engine was used, and after 5.5, the innodb engine was used.

In innodb, the format of the secondary index is as follows?

As shown in the figure above, the leaf node of the primary key index holds the real data. The data area of the secondary index leaf node holds the value of the primary key index key. The search process is as follows: if you want to query the data of name = seven, first find the primary key id = 101in the secondary index, then search the data with id = 101in the primary key index, and finally get the real data in the leaf node of the primary key index. Therefore, it is necessary to retrieve the index twice through the auxiliary index.

Put the difference between Innodb and MYISAM in one picture, as shown below:

Several principles for creating an index:

1. Discrete type of column:

The calculation formula of discrete type: count (distinct col): count (col). The higher the discrete type is, the better the choice type is.

As for the fields in the following table, which column has the best discrete type:

In the figure above, you can obviously see that the discrete type of name is the best, if you create an index with sex:

Why is it that the higher the discrete type, the better the alternative type?

As shown in the following figure, if you create an index on Sex, the index structure will be as follows:

If the data with sex = 1 is retrieved at this time, when the root node is judged, the result is to query the left subtree, but when the judgment is made at the second level of the left subtree, because the left and right branches meet the conditions, it is difficult to choose which branch to continue the search, or to search both branches at the same time.

2. Leftmost matching principle

When comparing keywords in the index, you must compare them from left to right, and you can't skip them. All the id explained above are int data. If id is a string, please see the figure below:

When matching, the string is converted into an ascll code, such as abc to 97 98 99, and then compared with one character from left to right. So the index becomes invalid when like% an is used in a sql query, because% indicates a full match, and if it is already a full match, there is no need for an index, which is better than a full table scan.

3. The principle of minimum space

As mentioned earlier, the smaller the space occupied by keywords, the more keywords are saved in each node, the more keywords are loaded into memory each time, and the more efficient the retrieval is.

Joint index:

Single-column index: keywords in nodes [name]

Federated index: keywords in nodes [name, phoneNum]

A single-column index can be regarded as a special federated index, and the comparison of federated indexes is also based on the leftmost matching principle.

Principles for selecting federated index columns:

(1) column priority (leftmost matching principle)

(2) the column with high dispersion is preferred (for the plateau with high dispersion)

(3) priority is given to columns with small width (least space principle)

The following is a simple example of the problems you often encounter:

For example, the frequently used query sql is as follows:

Select * from users where name =?

Select * from users where name =? And pahoneNum =?

To speed up the retrieval, create an index for the above query sql as follows:

Create index idx_name on users (name)

Create index idx_name_phoneNum on users (name, phoneNum)

In the above solution, according to the leftmost matching principle, idx_name is a redundant index, where name =? The index idx_name_phoneNum can also be used for retrieval. Redundant indexes increase or decrease the performance consumption when maintaining B+TREE balance and take up disk space.

Override index:

If the column of the query can be returned directly through the information of the index entry, the index is called the overlay index of the query SQL. Overwriting indexes can improve the efficiency of queries.

The following is an example to illustrate overriding the index.

Table: teacher

Index: competition (id), key (name, phoneNum), unique (teacherNo)

Which of the following sql uses override indexes?

Select teacherNo from teacher where teacherNo =?: when the teacherNo is retrieved, the teacherNo value in the index can be returned directly without entering the data area.

Select id,teacherNo from teacher where teacherNo =?: when used, the leaf node of the secondary index holds the value of the primary index, so you can return id between the leaf nodes of the secondary index when retrieved.

Select name,phoneNum from teacher where teacherNo =?: not used

Select phoneNum from teacher where name =?, I used it.

Knowing the overlay index, we know why sql requires you not to use select * as far as possible, and to specify the specific fields to be queried. One reason is that in the case of using the overlay index, the data can be returned directly without entering the data area, thus improving the efficiency of the query.

Through the previous study, we can easily understand the following conclusions:

1. When the data length of the index column satisfies the service, it can be less or less.

2. The more indexes in the table, the better.

3. In Where conditions, like 9%, like% 9% and like%9 do not need an index. The latter two methods are not valid for indexes. The first 9% is uncertain and depends on the discreteness of the columns. In conclusion, it can be used that if the discretization is found to be particularly poor, the query optimizer feels that the query performance of walking the index is even worse than that of full table scanning.

4. NOT IN cannot use index in Where condition.

5. Use more specified queries, return only the columns you want, and use less select *.

6. If the function is used in the query condition, the index will fail, which is related to the discrete type of the column. Once the function is used, the function is uncertain.

7. In a federated index, an index cannot be used if it does not start searching according to the leftmost column of the index.

8. the index can be used to exactly match the leftmost front column and the range to match another column for the federated index.

9. In a federated index, if the query has a range query for a column, all the columns on the right cannot use the index.

These are the details of in-depth understanding of Mysql's B+Tree index principle, please pay more attention to other related articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.