In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)05/31 Report--
This article focuses on "principles for the use of MySQL Index". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn the principles of using MySQL index.
I. comparison of storage engines
Note: the B-tree index mentioned above does not indicate B-Tree and B+Tree indexes, but the definitions of B-tree and B + tree are different.
In MySQL, there are four main types of indexes: B-Tree index, Hash index, Fulltext index and R-Tree index.
B-Tree index is the most frequently used index type in MySQL database, and all storage engines except Archive storage engine support B-Tree index. The Archive engine did not support indexing until MySQL 5.1, and only supported indexing of a single AUTO_INCREMENT column.
Not only in MySQL, but also in many other database management systems, B-Tree index is also the most important index type. This is mainly because the storage structure of B-Tree index has excellent performance in database data retrieval.
Generally speaking, most of the physical files of the B-Tree index in MySQL are stored in the Balance Tree structure, that is, all the actual data are stored in the Tree Leaf Node (leaf node), and the length of the shortest path to any Leaf Node is exactly the same, so we all call it the B-Tree index.
Of course, it is possible that various databases (or MySQL's various storage engines) will slightly modify the storage structure when storing their own B-Tree indexes.
For example, the actual storage structure used by the B-Tree index of the Innodb storage engine is actually B+Tree, that is, a small modification has been made on the basis of the B-Tree data structure. In addition to storing the relevant information of the index key on each LeafNode, it also stores the pointer information to the next LeafNode adjacent to the LeafNode (adding sequential access pointers), which is mainly to speed up the efficiency consideration of retrieving multiple adjacent LeafNode.
InnoDB is the default storage engine for Mysql (Mysql5.5.5 used to be MyISAM)
Next, let's look at the concepts of B-tree and B + tree. Figure out why indexed queries are faster?
Second, the concept of B-tree and B + tree
B tree
Namely binary search tree:
1. All non-leaf nodes have at most two sons (Left and Right).
2. All nodes store a keyword
3. The left pointer of a non-leaf node points to a subtree that is less than its keyword, and the right pointer points to a subtree that is larger than its keyword.
Such as:
B-tree
Is a multipath search tree (not binary):
1. Define that any non-leaf node has at most M sons, and M > 2
2. The number of sons in the root node is [2, M]
3. The number of sons of non-leaf nodes other than root nodes is [M]
4. Each node stores at least 2-1 (rounded up) and at most 1 keyword (at least 2 keywords).
5. The number of keywords in non-leaf nodes = the number of pointers to the son-1
6. Keywords of non-leaf node: K [1], K [2], … , K [M-1]; and K [I]
< K[i+1]; 7、非叶子结点的指针:P[1], P[2], …, P[M];其中P[1]指向关键字小于K[1]的子树,P[M]指向关键字大于K[M-1]的子树,其它P[i]指向关键字属于(K[i-1], K[i])的子树; 8、所有叶子结点位于同一层; 如:(M=3) B-树的搜索,从根结点开始,对结点内的关键字(有序)序列进行二分查找,如果命中则结束,否则进入查询关键字所属范围的儿子结点;重复,直到所对应的儿子指针为空,或已经是叶子结点; B-树的特性: 1、关键字集合分布在整颗树中; 2、任何一个关键字出现且只出现在一个结点中; 3、搜索有可能在非叶子结点结束; 4、其搜索性能等价于在关键字全集内做一次二分查找; 5、自动层次控制; 由于限制了除根结点以外的非叶子结点,至少含有M/2个儿子,确保了结点的至少利用率。 所以B-树的性能总是等价于二分查找(与M值无关),也就没有B树平衡的问题; 由于M/2的限制,在插入结点时,如果结点已满,需要将结点分裂为两个各占M/2的结点;删除结点时,需将两个不足M/2的兄弟结点合并; B+树 B+树是B-树的变体,也是一种多路搜索树: 1、其定义基本与B-树同,除了: 2、非叶子结点的子树指针与关键字个数相同; 3、非叶子结点的子树指针P[i],指向关键字值属于[K[i], K[i+1])的子树(B-树是开区间); 5、为所有叶子结点增加一个链指针; 6、所有关键字都在叶子结点出现; 如:(M=3)The search of B + is basically the same as that of B-tree, except that the B + tree can be hit only when it reaches the leaf node.
Non-leaf node hit), its performance is also equivalent to doing a binary search in the full set of keywords.
Features of B+:
1. All keywords appear in the linked list of leaf nodes (dense index), and the keywords in the linked list happen to be ordered.
2. It is impossible to hit at a non-leaf node.
3. The non-leaf node is equivalent to the index (sparse index) of the leaf node, and the leaf node is the data layer for storing (keyword) data.
4. More suitable for file indexing system
After understanding the concept of Bripple B+ tree, we continue to analyze the principle of B+ tree to improve efficiency.
Third, the principle of B + tree index.
As shown above, it is a b + tree. For the definition of the b + tree, see the B + tree. Here we only talk about some key points. The light blue block is called a disk block. We can see that each disk block contains several data items (shown in dark blue) and pointers (shown in yellow). For example, disk block 1 contains data items 17 and 35, including pointers P1, P2, P3, P1 for blocks less than 17, and P2 for blocks between 17 and 35. P3 represents a block greater than 35. The real data exists in leaf nodes 3, 5, 9, 10, 13, 15, 28, 29, 36, 60, 75, 79, 90, 99. Non-leaf nodes only do not store real data, but only store data items that guide the search direction, such as 17 and 35 do not really exist in the data table.
The search process of b + tree
As shown in the figure, if you want to find data item 29, then disk block 1 will first be loaded into memory from the disk. At this time, an IO occurs, and the binary search in memory determines that 29 is between 17 and 35. Lock the P2 pointer of disk block 1, and the memory time is negligible because it is very short (compared to the IO of the disk). Disk block 3 is loaded into memory from disk through the disk address of the P2 pointer of disk block 1, and the second IO occurs. 29 between 26 and 30, lock the P2 pointer of disk block 3, load disk block 8 into memory through the pointer, and the third IO occurs. At the same time, do a binary search in memory to find 29, end the query, a total of three times IO. The truth is, the 3-tier b + tree can represent millions of data. If millions of data lookups only need three IO, the performance improvement will be huge. If there is no index, each data item will have to have one IO, then a total of millions of IO will be required, obviously the cost is very high.
B + tree property
1. Through the above analysis, we know that the number of IO depends on the height h of the b + number. Assuming that the data in the current data table is N, and the number of data items in each disk block is m, then there is h = quantity (m) N. when the amount of data N is constant, the larger the m is, the smaller the h is. And m = the size of the disk block / the size of the data item, the size of the disk block is the size of a data page, is fixed, if the data item occupies less space, the more the number of data items, the lower the height of the tree. This is why each data item, that is, the index field, should be as small as possible. For example, int occupies 4 bytes, which is less than half of bigint8 bytes. This is why the b + tree requires that the real data be placed on the leaf node rather than the inner node. Once placed on the inner node, the data items of the disk block will drop significantly, causing the tree to grow. When the data item is equal to 1, it will degenerate into a linear table.
2. When the data item of the b + tree is a compound data structure, such as (name,age,sex), the b + number builds the search tree in the order from left to right. For example, when the data such as (Zhang San, 20 Magi F) is retrieved, the b + tree will first compare name to determine the next search direction. If name is the same, then compare age and sex in turn, and finally get the retrieved data. But when there is no name data like (205F), the b + tree doesn't know which node to check next, because name is the first comparison factor when building a search tree, and you have to search according to name before you know where to query next. For example, when retrieving data such as (Zhang San, F), the b + tree can use name to specify the search direction, but the next field age is missing, so we can only find all the data whose name is equal to Zhang San, and then match the data whose gender is F, which is a very important property, that is, the leftmost matching feature of the index.
Slow query optimization
About the principle of MySQL index is a relatively boring thing, we only need to have a perceptual understanding, and do not need to understand very thoroughly and deeply. Let's go back to the slow query we talked about at the beginning. After understanding the principle of indexing, do you have any ideas? First, summarize the basic principles of the index.
IV. Several major principles of indexing
1, the leftmost prefix matching principle, a very important principle, mysql will always match to the right until it encounters a range query (>, 3 and d = 4). If you build an index in the order of (a recorder bforce c), d does not need an index, if you build an index (a recorder bforce d c), you can use it, and the order of a line d can be adjusted at will.
2, = and in can be out of order, for example, a = 1 and b = 2 and c = 3 indexes can be built in any order, and mysql's query optimizer will help you optimize it into a form that the index can recognize.
3. Try to select a highly differentiated column as the index. The formula of distinguishing degree is count (distinct col) / count (*), indicating the proportion of non-repetitive fields. The larger the proportion, the fewer records we scan, and the differentiation degree of the only key is 1. While some status and gender fields may be 0 in front of big data, some people may ask, is there any empirical value for this ratio? With different scenarios, this value is also difficult to determine. Generally, we need more than 0.1 for the fields that need join, that is, an average of 10 records are scanned.
4, the index column can not participate in the calculation, keep the column "clean", such as from_unixtime (create_time) = '2014-05-29' can not be used to index, the reason is very simple, b + tree is stored in the data table field values, but for retrieval, all elements need to be compared to apply the function, obviously the cost is too large. So the statement should be written as create_time = unix_timestamp ('2014-05-29')
5. Expand the index as much as possible, do not create a new index. For example, if you already have an index of an in the table, and now you want to add the index of (aforme b), you only need to modify the original index.
At this point, I believe you have a deeper understanding of the "principles for the use of MySQL indexes". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.