What is the principle of MySQL index B + tree and what are the principles of indexing? 07/01 Update SLTechnology News&Howtos

What is the principle of MySQL index B + tree and what are the principles of indexing?

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article will explain in detail the principle of MySQL index B + tree and what are the major principles of indexing. The content of the article is of high quality, so the editor will share it for you to do a reference. I hope you will have a certain understanding of the relevant knowledge after reading this article.

The fact that MySQL uses different storage engines is also very different, as apes can learn about it below.

I. comparison of storage engines

Note: the B-tree index mentioned above does not indicate B-Tree and B+Tree indexes, but the definitions of B-tree and B + tree are different.

In MySQL, there are four main types of indexes: B-Tree index, Hash index, Fulltext index and R-Tree index.

B-Tree index is the most frequently used index type in MySQL database, and all storage engines except Archive storage engine support B-Tree index. The Archive engine did not support indexing until MySQL 5.1, and only supported indexing of a single AUTO_INCREMENT column.

Not only in MySQL, but also in many other database management systems, B-Tree index is also the most important index type. This is mainly because the storage structure of B-Tree index has excellent performance in database data retrieval.

Generally speaking, most of the physical files of the B-Tree index in MySQL are stored in the Balance Tree structure, that is, all the actual data are stored in the Tree Leaf Node (leaf node), and the length of the shortest path to any Leaf Node is exactly the same, so we all call it the B-Tree index.

Of course, it is possible that various databases (or MySQL's various storage engines) will slightly modify the storage structure when storing their own B-Tree indexes. For example, the actual storage structure used by the B-Tree index of the Innodb storage engine is actually B+Tree, that is, a small modification has been made on the basis of the B-Tree data structure. In addition to storing the relevant information of the index key on each LeafNode, it also stores the pointer information to the next LeafNode adjacent to the LeafNode (adding sequential access pointers), which is mainly to speed up the efficiency consideration of retrieving multiple adjacent LeafNode.

InnoDB is the default storage engine for Mysql (Mysql5.5.5 used to be MyISAM)

Next, let's look at the concepts of B-tree and B + tree. Figure out why indexed queries are faster?

B-tree, B + tree concept B-tree

Namely binary search tree:

All non-leaf nodes have at most two sons (Left and Right)

All nodes store a keyword

The left pointer of a non-leaf node points to a subtree less than its keyword, and the right pointer points to a subtree that is greater than its keyword.

Such as:

B-tree

Is a multipath search tree (not binary):

Define that any non-leaf node has at most M sons, and M > 2

The number of sons in the root node is [2, M]

The number of sons of non-leaf nodes other than root nodes is [M _ pr _ 2, M]

Each node stores at least 2-1 (rounded up) and at most 1 keyword (at least 2 keywords).

Number of keywords for non-leaf nodes = number of pointers to sons-1

Keywords for non-leaf nodes: K [1], K [2], … , K [M-1]; and K [I]

< K[i+1]；非叶子结点的指针：P[1], P[2], …, P[M]；其中P[1]指向关键字小于K[1]的子树，P[M]指向关键字大于K[M-1]的子树，其它P[i]指向关键字属于(K[i-1], K[i])的子树；所有叶子结点位于同一层；如： B-树的搜索，从根结点开始，对结点内的关键字（有序）序列进行二分查找，如果命中则结束，否则进入查询关键字所属范围的儿子结点；重复，直到所对应的儿子指针为空，或已经是叶子结点； B-树的特性：关键字集合分布在整颗树中；任何一个关键字出现且只出现在一个结点中；搜索有可能在非叶子结点结束；其搜索性能等价于在关键字全集内做一次二分查找；自动层次控制；由于限制了除根结点以外的非叶子结点，至少含有M/2个儿子，确保了结点的至少利用率。所以B-树的性能总是等价于二分查找（与M值无关），也就没有B树平衡的问题；由于M/2的限制，在插入结点时，如果结点已满，需要将结点分裂为两个各占M/2的结点；删除结点时，需将两个不足M/2的兄弟结点合并； B+树 B+树是B-树的变体，也是一种多路搜索树： 1、其定义基本与B-树同，除了： 2、非叶子结点的子树指针与关键字个数相同； 3、非叶子结点的子树指针P[i]，指向关键字值属于[K[i], K[i+1])的子树（B-树是开区间）； 5、为所有叶子结点增加一个链指针； 6、所有关键字都在叶子结点出现；如： B+的搜索与B-树也基本相同，区别是B+树只有达到叶子结点才命中（B-树可以在非叶子结点命中），其性能也等价于在关键字全集做一次二分查找； B+的特性：所有关键字都出现在叶子结点的链表中（稠密索引），且链表中的关键字恰好是有序的；不可能在非叶子结点命中；非叶子结点相当于是叶子结点的索引（稀疏索引），叶子结点相当于是存储（关键字）数据的数据层；更适合文件索引系统；了解B-/B+树的概念之后，我们继续分析B+树提高效率的原理。三、B+树索引原理如上图，是一颗b+树，关于b+树的定义可以参见B+树，这里只说一些重点，浅蓝色的块我们称之为一个磁盘块，可以看到每个磁盘块包含几个数据项（深蓝色所示）和指针（黄色所示），如磁盘块1包含数据项17和35，包含指针P1、P2、P3，P1表示小于17的磁盘块，P2表示在17和35之间的磁盘块，P3表示大于35的磁盘块。真实的数据存在于叶子节点即3、5、9、10、13、15、28、29、36、60、75、79、90、99。非叶子节点只不存储真实的数据，只存储指引搜索方向的数据项，如17、35并不真实存在于数据表中。 b+树的查找过程如图所示，如果要查找数据项29，那么首先会把磁盘块1由磁盘加载到内存，此时发生一次IO，在内存中用二分查找确定29在17和35之间，锁定磁盘块1的P2指针，内存时间因为非常短（相比磁盘的IO）可以忽略不计，通过磁盘块1的P2指针的磁盘地址把磁盘块3由磁盘加载到内存，发生第二次IO，29在26和30之间，锁定磁盘块3的P2指针，通过指针加载磁盘块8到内存，发生第三次IO，同时内存中做二分查找找到29，结束查询，总计三次IO。真实的情况是，3层的b+树可以表示上百万的数据，如果上百万的数据查找只需要三次IO，性能提高将是巨大的，如果没有索引，每个数据项都要发生一次IO，那么总共需要百万次的IO，显然成本非常非常高。 b+树性质 1、通过上面的分析，我们知道IO次数取决于b+数的高度h，假设当前数据表的数据为N，每个磁盘块的数据项的数量是m，则有h=㏒(m+1)N，当数据量N一定的情况下，m越大，h越小；而m = 磁盘块的大小 / 数据项的大小，磁盘块的大小也就是一个数据页的大小，是固定的，如果数据项占的空间越小，数据项的数量越多，树的高度越低。这就是为什么每个数据项，即索引字段要尽量的小，比如int占4字节，要比bigint8字节少一半。这也是为什么b+树要求把真实的数据放到叶子节点而不是内层节点，一旦放到内层节点，磁盘块的数据项会大幅度下降，导致树增高。当数据项等于1时将会退化成线性表。 2、当b+树的数据项是复合的数据结构，比如(name,age,sex)的时候，b+数是按照从左到右的顺序来建立搜索树的，比如当(张三,20,F)这样的数据来检索的时候，b+树会优先比较name来确定下一步的所搜方向，如果name相同再依次比较age和sex，最后得到检索的数据；但当(20,F)这样的没有name的数据来的时候，b+树就不知道下一步该查哪个节点，因为建立搜索树的时候name就是第一个比较因子，必须要先根据name来搜索才能知道下一步去哪里查询。比如当(张三,F)这样的数据来检索时，b+树可以用name来指定搜索方向，但下一个字段age的缺失，所以只能把名字等于张三的数据都找到，然后再匹配性别是F的数据了，这个是非常重要的性质，即索引的最左匹配特性。慢查询优化关于MySQL索引原理是比较枯燥的东西，大家只需要有一个感性的认识，并不需要理解得非常透彻和深入。回头来看看一开始我们说的慢查询，了解完索引原理之后，大家是不是有什么想法呢？先总结一下索引的几大基本原则。四、建索引的几大原则 1、最左前缀匹配原则，非常重要的原则，mysql会一直向右匹配直到遇到范围查询(>

, 3 and d = 4 if you build an index in the order of (amemeb), d does not need an index, but if you build an index of (a-mineb), you can all use it, and the order of a-recorder can be adjusted at will.

2, = and in can be out of order, for example, a = 1 and b = 2 and c = 3 indexes can be built in any order, and mysql's query optimizer will help you optimize it into a form that the index can recognize.

3. Try to select a highly differentiated column as the index. The formula of distinguishing degree is count (distinct col) / count (*), indicating the proportion of non-repetitive fields. The larger the proportion, the fewer records we scan, and the differentiation degree of the only key is 1. While some status and gender fields may be 0 in front of big data, some people may ask, is there any empirical value for this ratio? With different scenarios, this value is also difficult to determine. Generally, we need more than 0.1 for the fields that need join, that is, an average of 10 records are scanned.

4, the index column can not participate in the calculation, keep the column "clean", such as from_unixtime (create_time) = '2014-05-29' can not be used to index, the reason is very simple, b + tree is stored in the data table field values, but for retrieval, all elements need to be compared to apply the function, obviously the cost is too large. So the statement should be written as create_time = unix_timestamp ('2014-05-29')

5. Expand the index as much as possible, do not create a new index. For example, if you already have an index of an in the table, and now you want to add the index of (a), you only need to modify the original index.

On the principle of MySQL index B + tree and what the major principles of indexing are shared here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.