What does index refer to in MySQL 07/11 Update SLTechnology News&Howtos

What does index refer to in MySQL

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article will explain in detail what the index refers to in MySQL. The editor thinks it is very practical, so I share it for you as a reference. I hope you can get something after reading this article.

What is the index?

A table has 5 million pieces of data, and execute a where query on an unindexed name field:

Select * from user_innodb where name = 'pony'

What if there is an index on the name field? Create an index above the name field, and then execute the same query.

ALTER TABLE user_innodb DROP INDEX idx_name; ALTER TABLE user_innodb ADD INDEX idx_name (name)

The efficiency of indexed queries is tens of times lower than that of non-indexed queries.

Through this case, we should be able to feel very intuitively that the index can greatly improve the performance of data retrieval.

So what exactly is the index? Why can it have such a big impact on our queries? What happened when the index was created?

Index definition

Database index is a sorted data structure in database management system (DBMS) to help quickly query and update data in database tables.

Data is stored on disk in the form of a file, and each line of data has its disk address. If there is no index, we have to retrieve a piece of data from 5 million rows of data, and we can only iterate through all the data in the table until we find it.

But after we have the index, we only need to retrieve this data in the index, because it is a special data structure for quick retrieval. After we find the disk address where the data is stored, we can get the data.

Index type

In InnoDB, there are three types of indexes: general index, unique index (primary key index is a special unique index), and full-text index.

Normal: also known as non-unique index, is the most common index, there are no restrictions.

Unique (Unique): a unique index requires that the key value cannot be repeated. It is also important to note that the primary key index is a special unique index with an additional constraint that the key value cannot be empty. The primary key is created by referencing primay key.

Full-text (Fulltext): for large data, such as we store message content and have several KB data, you can create a full-text index if you want to solve the problem of inefficient like query. Only fields of text type can create full-text indexes, such as char, varchar, text.

Index is a kind of data structure, so what kind of data structure should it choose in order to achieve efficient data retrieval?

Index storage model deduces binary search

After the Singles Day holiday, your girlfriend played a game of guessing numbers with you. Guess how much I bought yesterday. I'll give you five chances.

10000? It's low. 30000? It's high. How much will you guess next? 20000 . Why don't you guess 11000 or 29000?

This is an idea of binary search, also known as half search, and each time, we have reduced the candidate data by half. This approach is more efficient if the data is already sorted.

So first, we can consider using ordered arrays as indexed data structures.

The efficiency of equivalent query and comparison query of ordered array is very high, but there will be a problem when updating data, which may have to move a large amount of data (change index), so it is only suitable for storing static data.

In order to support frequent changes, such as inserting data, we need to use linked lists. Linked list, if it is a single linked list, its search efficiency is still not high enough.

So, is there a linked list that can be used for binary search?

In order to solve this problem, BST (Binary ["ba search n" ri] Search Tree), which is what we call binary search tree, was born.

Binary search tree (Binary Search Tree)

All nodes in the left subtree are smaller than the parent node, and all nodes in the right subtree are larger than the parent node. When projected onto a plane, there is an ordered linear table.

Binary search tree can not only achieve fast search, but also achieve fast insertion.

But there is a problem with the binary search tree: the search time is related to the depth of the tree, and in the worst case the time complexity will degenerate to O (n).

What is the worst-case scenario?

Or this batch of numbers just now, if the data we insert happens to be in order, 2, 10, 12, 15, 21, 28

At this time, the BST will become a linked list ("oblique tree"). In this case, it can not achieve the purpose of speeding up the retrieval speed, which is no different from the sequential search efficiency.

What is the cause of its tilt?

Because the depth difference between the left and right subtrees is too large, the left subtree of this tree has no nodes at all-that is, it is not balanced enough.

So, do we have a more balanced tree with less difference in depth between the left and right subtrees?

This is the balanced binary tree, called Balanced binary search trees, or AVL tree.

Balanced binary tree (AVL Tree)

The definition of balanced binary tree: the absolute value of the depth difference between left and right subtrees cannot exceed 1.

What does it mean? For example, the depth of the left subtree is 2 and the depth of the right subtree can only be 1 or 3.

At this time, we insert 1, 2, 3, 4, 5 and 6 in order, which must be the case and will not become a "sloping tree".

So how do you achieve the balance of AVL trees? How to ensure that the depth difference between the left and right subtrees is not more than 1? For example: insert 1, 2, 3.

When we insert 1 and 2, according to the definition of binary search tree, 3 must be on the right side of 2. At this time, the right node depth of root node 1 becomes 2, but the depth of left node is 0, because it has no child nodes. So it violates the definition of balanced binary tree.

What should I do then? Because it is a right node followed by a right node, right-right type, so at this time we have to lift 2 up, this operation is called left-handed.

Similarly, if we insert 7, 6, 5, it will become left-left, and a right-handed operation will occur to lift 6 up.

So in order to keep the balance, the AVL tree performs a series of calculations and adjustments when inserting and updating data.

We have solved the problem of balance, so how to query the data using the balanced binary tree as the index? In a balanced binary tree, a node whose size is a fixed unit, what should be stored as an index?

The first: the key value of the index. For example, if we create an index on id, I will find the key value of id in the index when I query with the condition of where id = 1.

The second: the disk address of the data, because the function of the index is to find the address where the data is stored.

The third because it is a binary tree, it must also have references to the left and right child nodes so that we can find the next node. For example, when it is greater than 26, go to the right, go to the node of the next tree, and continue to judge.

If the data is stored in this way, let's see what the problem is.

First of all, the indexed data is put on the hard disk. View the size of the data and index:

Select CONCAT (ROUND (SUM (DATA_LENGTH/1024/1024), 2), 'MB') AS data_len, CONCAT (ROUND (SUM (INDEX_LENGTH/1024/1024), 2),' MB') as index_len from information_schema.TABLES where table_schema='gupao' and table_name='user_innodb'

When we use the structure of the tree to store the index, because to get a piece of data, we have to compare the required data at the Server layer, and if not, we have to read the disk again. To access a node, an IO occurs between the node and the disk. The smallest unit of an InnoDB operating disk is a page (or a disk block) with a size of 16K (16384 bytes).

So, the node of a tree is 16K in size. If we save only one key value + data + reference, such as a shaping field, it may only use more than a dozen or dozens of bytes, which is far less than the capacity of 16K, so it wastes a lot of space when visiting a tree node and doing an IO.

So if each node stores too little data and finds the data we need from the index, we have to access more nodes, which means that there will be too many interactions with the disk.

In the era of mechanical hard drives, it takes about 10ms addressing time to read data from disk, and the more interactions, the more time it takes.

For example, in the figure above, we have six pieces of data in a table. When we query id=37, to query two child nodes, we need to interact with the disk three times. What if we have millions of data? This time is even more difficult to estimate.

So what's our solution?

The first is to let each node store more data.

Second, the more keywords there are on the node, the more pointers we have, which means there can be more bifurcations.

Because the more the number of bifurcations, the less the depth of the tree (the root node is 0). In this way, has our tree changed from being tall and thin to being short and fat?

At this time, our tree is no longer binary, but multi-forked, or multi-path.

Multipath balanced search tree (B Tree)

Like the AVL tree, the B tree stores key values, data addresses, and node references on branch and leaf nodes.

It has a feature: the number of bifurcations (paths) is always 1 more than the number of keywords. For example, in this tree we draw, each node stores two keywords, then there will be three pointers to the three child nodes.

What are the search rules for B Tree?

For example, we need to look for 15 in this table. Because 15 is less than 17, go left. Because 15 is greater than 12, go right. 15 was found in disk block 7 and IO was used only three times.

Is this more efficient than the AVL tree? So how does B Tree make it possible for a node to store multiple keywords while maintaining a balance? What's the difference between it and AVL tree?

For example, when Max Degree (number of paths) is 3, we insert data 1, 2, 3. When inserting 3, it should be in the first disk block, but if a node has three keywords, it means that there are 4 pointers, and the child node will become 4-way, so it must be split at this time (actually B+Tree). Lift up the middle data 2 and turn 1 and 3 into child nodes of 2.

If you delete a node, there will be the opposite merge operation.

Note that this is split and merge, which is different from the left and right rotation of the AVL tree.

If we continue to insert 4 and 5 Tree, there will be splitting and merging operations.

We can also see from this that there are a lot of structural adjustments to the index when updating the index, so it explains why we don't index on columns that are updated frequently, or why we don't update the primary key.

The splitting and merging of nodes is actually the splitting and merging of InnoDB pages (page).

B + tree (enhanced B Tree)

B Tree is already very efficient, so why does MySQL improve B Tree and end up using B+Tree?

Overall, the improved version of the B-tree solves the problem more comprehensively than the B Tree.

Let's take a look at the storage structure of the B+ tree in InnoDB:

B+Tree in MySQL has several characteristics:

The number of keywords is equal to the number of roads.

Neither the root node nor the branch node of the B+Tree stores data, only the leaf node stores the data. The search keyword will not be returned directly, but will go to the leaf node of the last layer. For example, we search id=28, although directly hit at the first layer, but all the data is above the leaf node, so I will continue to search all the way to the leaf node.

Each leaf node of B+Tree adds a pointer to the adjacent leaf node, and its last data points to the first data of the next leaf node, forming an ordered linked list structure.

It retrieves data according to the left-closed and right-open interval [).

The data search process of B+Tree:

For example, if we want to find 28, we will find the key value in the root node, but because it is not a page child node, we will continue to search. 28 is the critical value of the left closed and right open interval of [28jie66), so we will go to the middle child node, and then continue the search. It is also the critical value of the left closed and right open interval of [28ji34), so we will go to the left child node, and finally find the needed data on the leaf node.

Second, if it is a range query, such as querying data from 22 to 60, when 22 is found, all data nodes can be accessed at once by traversing the nodes and pointers at once. this greatly improves the efficiency of interval query (there is no need to return to the upper parent node to repeat traversal lookups).

Characteristics of B+Tree in InnoDB:

It is a variant of B Tree, and it can solve any problem that B Tree can solve. What are the two big problems solved by B Tree? (more keywords per node; more paths)

The ability of scanning database and table is stronger (if we want to scan the whole table, we only need to traverse the leaf node, not the whole B+Tree to get all the data)

The disk read and write ability of B+Tree is stronger than that of B Tree (root and branch nodes do not save data areas, so one node can save more keywords, and more keywords are loaded on disk at a time)

Stronger sorting ability (because there is a pointer to the next data area on the leaf node, and the data forms a linked list)

The efficiency is more stable (B+Tree always gets the data at the leaf node, so the number of IO is stable).

This is the end of the article on "what the index in MySQL refers to". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.