What is the reason why indexes can improve query performance? 07/12 Update SLTechnology News&Howtos

What is the reason why indexes can improve query performance?

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces "what is the reason why the index can improve the query performance". In the daily operation, I believe that many people have doubts about the reason why the index can improve the query performance. The editor consulted all kinds of data and sorted out the simple and easy-to-use operation methods. I hope it will be helpful for everyone to answer the doubt that "the index can improve the query performance". Next, please follow the editor to study!

Binary tree

A hierarchical set of n (n > 0) finite nodes looks like an upside-down tree, so such a data structure is called a tree.

The number of sub-nodes of a node is called degrees, which, in popular terms, is the number of tree forks. The greatest degree in a tree is called the degree of the tree, also known as the order. A second-order tree has at most two child nodes, that is, at most two forks, so such a tree is called a binary tree, which is the simplest tree in the tree family.

A two-forked tree is a binary tree, but in addition to storing data according to a certain structure, it seems to have nothing to do with query performance, so it won't be another useless stunt.

Binary search

It is said that the original power of binary tree comes from an algorithm called binary search.

According to legend, in the primitive society of parrots, there was a strict hierarchical system, and each bird must be divided into rank and inferiority according to the order of height and inferiority.

So the question is, as shown in the picture below, how can we find the tallest, shortest and medium height parrots, as well as the one of a specified height?

The first method: scanning method

Measure one by one, and after that all the problems are easily solved.

This method of all measurements in turn is called scanning, and its shortcomings are obvious, the highest and the shortest, and need to be fully measured before it can be known.

For the specified height, the best case is to find it the first time; the worst case is to find it the last time, and the time complexity is n, that is to say, if you find the specified height from 13 parrots, the worst case is 13 times.

The second method: dichotomy

All 13 parrots obey orders, line up from short to tall, look to the left, and report.

The one who reports the number 1 is the shortest, the one who reports the number 13 is the tallest, and the one who reports the number 7 is the one of medium height.

Both the best and the worst are found at once. And the query performance suddenly improved 13 times, my dear, no matter how many parrots, the time complexity is 1, very terrible.

Question: I am not convinced, you are secretly changing the concept, have the ability to compare the performance of a parrot of a specified height.

Because the parrots have lined up according to their height, the parrot of a specified height is either standing in the middle or in the group on its left or right.

If it's the middle one, find it at once, if you don't just need to look in the left or right half of the middle, then find the middle one in this half, and compare your height.

And so on, each time the scope of the query is halved, and the time complexity is log2 (n).

So log2 (13) is 4, and the worst-case scenario is only 4 times. The time complexity is really not 1, but it doesn't seem bad. It is simplified as follows:

Question: if you queue by height and height, you still need a comparison. What's the difference between scanning and scanning? why not scan directly?

Indeed, a simple query, sorting first, and then binary search, is not necessarily faster than scanning, or even worse.

However, in the world of data, most data will be queried countless times in a lifetime. If the data is ordered only once when the data is born, is it possible to directly use binary search for the rest of the life? this seems to be the legend of reading more and writing less. And the corresponding reuse.

Advantages:

Find it quickly

Disadvantages:

It must be orderly and needs to be sorted in advance.

Every time you look up, you need to constantly calculate the middle position.

Binary search tree

If a set of data does not change or does not change often, then their location is basically the same. But every query needs to recalculate the middle position is a waste, and the waste is shameful.

Can we organize all the intermediate nodes and take the intermediate nodes directly each time we use them?

Please look at the figure below to find all the intermediate nodes of a single binary search, connect them, and lift the middle node by hand, which is a binary search tree.

Advantages: the binary search tree realizes the binary search algorithm through the data structure, and makes up for the disadvantage of calculating the intermediate position every time by storing the data of the intermediate nodes.

Balanced binary tree:

If the binary search tree is constantly modified, such as deleting some nodes, after a period of time, the data (root) of the earliest intermediate node is probably not in the middle.

The middle position is like the fulcrum of a scale. if he is not in the middle, then the whole balance will be out of balance, and the unbalanced world will collapse into an unrelated crippled tree, or even reduced to a linked list or array.

The key of binary search algorithm lies in order and intermediate nodes, while the key of binary search tree is the maintenance of intermediate nodes. If the maintained node is no longer in the middle, then it loses its meaning.

Therefore, we must ensure that the "binary search tree" is a correct tree, a tree with a root node in the center, a tree with the same level (height) of left and right subtrees (the difference in height is not more than 1), a balanced tree.

The most common balanced binary tree is the red-black tree:

The red-black tree provides a series of node color rules, and the corresponding left-handed and right-handed operations to ensure the color rules, so as to achieve the balance of the tree.

Seeing these gaudy colors and complex rules makes people daunted at first glance, but all these are just to ensure the balance of the binary tree, because the operation of maintaining the balance is too troublesome to summarize in one sentence. We have to use a bunch of rules and steps that are difficult to distinguish between people and ghosts. As long as we follow these steps, we will certainly be able to achieve the balance of the binary tree.

Balanced binary tree = binary search tree + balanced (the difference between left and right height is not more than 1)

The balanced binary tree does not improve the performance of the binary search tree, it is just that the positive tree will not be reduced to a linked list or asymmetric incomplete tree by the binary foil (multiple additions and deletions), and the balance will always be maintained.

In addition, not only binary trees, but other kinds of trees also need to be orderly and balanced in order to exert the greatest power.

B-tree of multi-tree

A tree with two forks can halve the query, and the theory can improve performance by twice as much, so can multiple forks improve performance by more than one time?

The third-order (forked) tree of the following figure (all data is for demonstration only, not real distribution)

Each node maintains two data and points to up to three child nodes. As shown in figure 3, the data of the child nodes are less than 17,17-35, and greater than 35.

Suppose, look for the number 10 from the figure above, and the steps are as follows:

Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community

Find the root node, compare the size of 10 with 17 and 35, and find 10

< 17 在左子节点，也就是第 2 层节点; 从根节点的指针，找到左子节点，对比 10 与 8 和 12 的大小，发现 8 < 10 < 12，数据在当前节点的中间子节点，也就是第 3 层节点; 通过上步节点的指针，找到中间子节点(第 3 层节点)，对比 10 与 9 和 10 的大小，发现 9 < 10 == 10，因此找到当前节点的第二数即为结果。加上忽略的 12 个数据，从 26 个数据中查找一个数字 10，仅仅用了 log3(26)≈ 3 次，而如果用平衡二叉树,则需要 log2(26)≈ 5 次，事实证明，多叉树确实可以再次提高查找性能。多叉树是在二分查找树的基础上，增加单个节点的数据存储数量，同时增加了树的子节点数，一次计算可以把查找范围缩小更多。优点：二叉平衡树的基础上，使加载一次节点，可以加载更多路径数据，同时把查询范围缩减到更小。复杂节点: 至此，我们列举的数据都是孤零零的单个数字。试想，你手里已经有一个数据 10，为什么还要费力吧唧的再从一堆数据中找到这个 10，自己找自己?这不是有病吗? 单个数字只能活在演示中，现实的世界要复杂的多，我们来看一个接近真实场景的案例。现有一个以年龄为索引的 3 阶树，存储了一批用户信息，如下图：数字为用户的年龄，其它为与树排序查找无关的业务数据，像这种索引数据与树排序查找无关的业务一起维护在节点的平衡多叉(阶)树称为 B- 树( B 树)。缺点：业务数据的大小可能远远超过了索引数据的大小，每次为了查找对比计算，需要把数据加载到内存以及 CPU 高速缓存中时，都要把索引数据和无关的业务数据全部查出来。本来一次就可以把所有索引数据加载进来，现在却要多次才能加载完。如果所对比的节点不是所查的数据，那么这些加载进内存的业务数据就毫无用处，全部抛弃。磁盘I/O 计算机的功能主要为：计算、存储和网络。而用于计算的数据以及计算后的结果很大一部分都需要存储起来，以备后续再次使用。向磁盘中存储和读取的过程叫磁盘 I/O。磁盘的读取方式和速度会严重影响到整个业务的计算性能。下面我们简单了解一下磁盘是如何工作的。磁盘大概长这个样子：

The disk is mainly composed of a disk, a drive arm, a read-write head and a motor.

For storage capacity, the spindle forms an array of disks like Tomatoes on sticks. Through the rotation of the spindle driven by the motor and the movement of the transmission arm, the read and write head is made to read and write data on the disk. It is roughly as follows:

The disk consists of many concentric circles with different radii, called tracks, on which the data is written.

Each track is divided into blocks called sectors.

If the disk is a notepad, then a disk is a page of the notebook, and the spindle is the binding line of the notebook; the track is the line of the page, and the sector can be seen as a wide column.

If you store a poem on disk, imagine something like this.

In order to read the disk, you need to find the disk on which the data is located, as well as the corresponding tracks and sectors. These operations are similar to finding the page, row, and column where the data is located from a book.

Because each disk corresponds to a head, the key to performance is to find rows and columns, that is, seek and disk rotation. Seek is to find the track where the data is located through the magnetic head, which is equivalent to wrapping to the line where the data is located. Because the magnetic head can only be moved horizontally, that is, it can only be found in a new line, it can not be moved on the specified track, so the disk needs to rotate and move to the specified sector at a high speed, similar to when writing Spring Festival couplets, the pen does not move and the paper moves.

To sum up, the read and write of the disk is to locate the location of the data through mechanical motion, while cpu is to carry out digital operations through electrical signals. Roughly speaking, the performance of mechanical query data is not in the same order of magnitude as the speed of light processing data, in a word, disk processing is too slow.

Although the disk is too slow to process data, it is currently a relatively cheap and stable storage device, so it can not be abandoned, but it can be optimized in the following ways.

Minimize the number of Icano, for example, you can use cache

Try to get as much data as possible each time.

Try to get useful data each time, and of course, indirectly reduce the total number of Icanos.

B+tree of multi-tree

As the index of the database, no matter what data structure is maintained, the data will eventually be stored on disk.

In view of the performance problems of disk Ipicuro and the limitation of the upper limit of the amount of data fetched each time, the best way to improve the index itself is to reduce the number of Icano and get useful data each time.

B-tree has greatly improved the performance of the tree family, storing multiple data centrally in one node, which itself may reduce the number of IBO or seek times.

However, there is still a fatal flaw, that is, its index data is bound with the business, and the size of the business data is probably much larger than that of the index data, which will greatly reduce the acquisition of useful data for one IWeiO, and indirectly increase the number of IWeiO to obtain useful index data.

Because the business data is the ultimate goal of our query, but it is also in the "dichotomy" to find useless data in the middle of the process, so is it okay to store the business data only in the node that is finally queried?

The ideal is very plump, the reality is very skinny, who knows which node is the final node to be queried?

B+tree came out of nowhere, and the B+ tree is to split the balanced multi-tree between index data and business data.

In the B+ tree, the non-leaf node only stores the index data, while the leaf node stores the index data and business data. This not only ensures that the leaf nodes are simple and clean, the amount of data is greatly reduced, but also ensures that the corresponding number of services can be found finally. The utility model not only improves the validity of the single Ihammer O data, but also reduces the number of IWeiO data, and also realizes the service.

However, in the data, the index is separated from the data, unlike the example?

As shown in the figure: we only need to replace the real business data with the address where the data resides. At this point, the address where the business data resides acts as business data in the B+ tree.

Summary

Data is stored on disk (SSD is not on the same order of magnitude as CPU), and the disk is slow to process the data

The improvement of disk performance is mainly due to the reduction of the number of Ithumb O and the amount of effective data in a single Ithumb O.

The index makes the structure of the tree shorter and fatter through multi-order (a node holds multiple data and points to multiple child nodes), thus reducing the number of I-O times.

The index separates the business data from the index data through the B + tree, so as to increase the amount of effective data in a single Ihand O, thus reducing the number of I hand O times.

The index greatly reduces the scope of the query through the ordering of tree data and "binary search" (multi-order trees can be assumed to be multi-part search).

The index is aimed at a single field or part of the field, and the amount of data itself is much less than that of a record, so that querying the index by scanning is much faster than scanning the database table itself.

Knowledge expansion

The biggest advantage of tree structure is its high query performance, so all those who need to improve query performance can consider trees.

And there are indeed such examples in reality, such as:

When the data in HashMap conflicts, the linked list is converted to a red-black tree

B+ tree used by database index

Dictionary tree used by search engine inverted index

The above is only a brief description of the reasons why database using B+ tree index can improve query performance and the simple process.

It does not go into the details of various data structures, nor does it mention the specific storage formats of other index types and indexes, just to give people a perceptual understanding of the index.

At this point, the study of "what is the reason why the index can improve query performance" is over. I hope to be able to solve everyone's doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.