The disadvantages of MySQL Database Index and how to use it 07/03 Update SLTechnology News&Howtos

The disadvantages of MySQL Database Index and how to use it

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article introduces the relevant knowledge of "the disadvantages of MySQL database index and how to use it". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Rational use of index

In our work, the most direct way to determine whether a field in a data table needs to be indexed is whether this field will often appear in our where conditions. From a macro point of view, there is no problem with this kind of thinking, but from a long-term point of view, sometimes you may need to think more carefully, such as do we not just need to build an index on this field? Is it better to have a federated index of multiple fields? Take a user table as an example, the fields in the user table may have the user's name, user's ID number, user's home address, and so on.

1. The disadvantages of general index

Now there is a need to find the user's name based on the user's ID number. Obviously, the first way to think of this is to set up an index on id_card, which is strictly unique, because the ID number must be unique, so when we execute the following query:

SELECT name FROM user WHERE id_card=xxx

The process should be like this:

First search the id_card index tree to find the primary key id corresponding to id_card

Search the primary key index through id to find the corresponding name

In terms of effect, the result is fine, but in terms of efficiency, it seems that this query is a bit expensive, because it retrieves two B + trees, assuming that the height of one tree is 3, then the height of the two trees is 6. Because the root node is in memory (here two root nodes), the final number of IO on disk is 4 times, in terms of the average time spent on a disk random IO is 10ms. Then eventually you need 40ms. This figure is mediocre and not fast.

two。 The trap of primary key index

Since the problem is to return to the table, resulting in retrieval in both trees, the core problem is to see if it can be retrieved on only one tree. Here, from a business point of view, you may find an entry point, the ID card number is unique, then our primary key can not be self-increasing id by default, we set the primary key to our ID number, so that the whole table only needs an index, and all the required data, including our names, can be found through the ID number. It seems reasonable to think simply, as long as each time the data is inserted. Just specify that id is the ID number, but there seems to be a problem when you think about it.

Here, in terms of the characteristics of the B+ tree, the data of the B+ tree is stored on the leaf node, and the data is managed on a page, and one page is 16K. What does it mean? Even if we are now a row of data, it still takes up 16K of data pages, and only when our data pages are full will we write to a new data page. The new data page and the old data page are not necessarily physically continuous. And it is critical that although the data page is physically discontinuous, the data is logically continuous.

You may wonder, what does this have to do with what we call the ID number as the primary key ID? At this point, you should pay attention to the keyword "continuous". The ID number is not continuous. What does this mean? When we insert a piece of discontiguous data, in order to keep it continuous, we need to move the data. For example, the original data on a page has 1-> 5, and a 3 is inserted at this time, so we need to move 5 to 3. You might say it doesn't cost much, but if the new data 3 causes the page A to be full, it depends on whether page B behind it has space. At this time, the starting data of page B should be the one that overflows from page A, and the corresponding data should be moved.

If page B does not have enough space at this time, then apply for a new page C, then move part of the data to the new page C, and will sever the relationship between page An and page B, inserting a page C between the two, from the code level, is to switch the pointer to the linked list.

To sum up, discontiguous ID numbers as primary keys may result in page data movement, random IO, and overhead associated with frequent applications for new pages. If we use a self-increasing primary key, it must be sequential for id, there will be no data movement problems caused by random IO, and the insertion overhead must be relatively small.

In fact, there is another reason why it is not recommended to use the ID number as the primary key: the ID number is too large as a number, so you have to use bigint to store it. Normally, it is enough for students in a school to use int. We know that a page can store 16K. When an index itself takes up more space, it will lead to less data per page, so in the case of a certain amount of data. Using bigint requires more pages, that is, more storage space than int.

3. Spear and shield of joint index

From the above two conclusions, we can draw a conclusion:

Try not to return to the table.

ID card number is not suitable for primary key index.

So it is natural to think of the joint index, create a [ID number + name] joint index, pay attention to the order of the joint index, in line with the leftmost principle. So when we also execute the following sql:

Select name from user where id_card=xxx

We do not need to go back to the table to get the name field we need. However, it still does not solve the problem that the ID number itself takes up too much space. This is the problem of the business data itself. If you want to solve it, we can convert the original large data into small data through some conversion algorithms, such as crc32:

Crc32.ChecksumIEEE ([] byte ("341124199408203232"))

The ID card number that originally requires 8 bytes of storage space can be replaced with a 4-byte crc code, so our database needs to add a field crc_id_card, and the joint index has changed from [ID card number + name] to [crc32 (ID card number) + name], and the joint index takes up less space. But this conversion also comes at a cost:

Each additional crc results in the need for more cpu resources

The extra fields make the space of the index smaller, but they also take up space.

There is a probability of conflict in crc, which requires us to filter the data according to id_card after querying the data. The cost of filtering depends on the number of duplicated data. The more duplicates, the slower the filtering.

With regard to the optimization of federated index storage, here is a small detail. Suppose there are two fields An and B, which occupy 8 bytes and 20 bytes respectively, and we have to support a separate query of B when the federated index is [Ascore B]. So naturally, we also build an index on B, so the space occupied by the two indexes is 8'20'20'48. Now we can use the index whether we can query through An or B, if the business permits, can we set up [BMaga] and An index? in this case, not only can we use the index to query data through An or B alone, but also take up less space: 20-8-8-36.

4. The short and pithy prefix index

Sometimes the field we need to index is of string type, and the string is very long. We want this field to be indexed, but we don't want the index to take up too much space, so we can consider setting up a prefix index. build an index with the first part of the characters of this field, which can not only enjoy the index, but also save space. It should be noted here that in the case of high prefix repetition, there should be a gap between the speed of the prefix index and the general index.

Alter table xx add index (name (7)); the first seven characters of # name are indexed select xx from xx where name= "JamesBond" 5. The speed and slowness of the unique index

Before we talk about unique indexes, let's take a look at the characteristics of general indexes. we know that for B + trees, the data of leaf nodes is ordered.

Suppose we want to query the data 2 now, then when we find 2 through the index tree, the storage engine does not stop searching, because there may be multiple 2s, which means that the storage engine will continue to look backward on the leaf node. after finding the second 2, did it stop? The answer is no, because the storage engine does not know if there are any more 2, so it has to look back until the first data that is not 2, that is, 3, is found. After finding 3, stop searching. This is the retrieval process of a general index.

The unique index is different, because of the uniqueness, there can be no duplicate data, so we will return directly after retrieving our target data, instead of looking back once more like the ordinary index. from this point of view, the unique index is faster than the ordinary index, but when the data of the ordinary index are all in one page, it will not be much faster. In terms of data insertion, the unique index may be slightly inferior, because of the uniqueness, each insert will need to determine whether the data to be inserted already exists, while the ordinary index does not need this logic, and it is important that the unique index will not use change buffer (see below).

6. Don't index blindly

In your work, you may encounter such a situation: do I need to index this field? For this problem, our common judgment is whether this field will be used in the query. If this field is often in the query condition, we may consider adding an index. But if you judge only according to this condition, you may have added the wrong index. Let's look at an example: suppose there is a user table with about 100w of data, and there is a gender field in the user table indicating that men and women account for almost half of each other. Now we want to count the information of all boys, and then we index the gender field, and we write down sql like this:

Select * from user where sex= "male"

If nothing happens, InnoDB will not choose the gender index. If you take the gender index, then you must need to return to the table, in the case of a large amount of data, what will be the consequences of returning to the table? I posted the same picture as above. I'm sure you all know:

Mainly is a large number of IO, a piece of data needs 4 times, what about 50w data? The result can be imagined. Therefore, in response to this situation, MySQL's optimizer has a high probability of doing a full table scan, directly scanning the primary key index, because the performance may be higher.

7. The things that the index fails.

In some cases, because of our own improper use, mysql can't use indexes, which can easily happen in type conversion. you might say, doesn't mysql already support implicit conversions? For example, there is an integer user_id index field, which is written as follows because we didn't pay attention to it during the query:

Select xx from user where user_id= "1234"

Note that here is 1234 of the character, and when this happens, MySQL is smart enough to convert 1234 of the character to 1234 of the number, and then happily uses the user_id index. But if we have a character user_id index field, or because we didn't pay attention to it when we queried, it was written as:

Select xx from user where user_id=1234

At this time, there will be a problem, there will be no index, you may ask, why not convert MySQL at this time, why not convert 1234 of the number to 1234 of the character? The rules of conversion need to be explained here. When there is a comparison between a string and a number, remember: MySQL converts a string to a number. You may also ask: why do you not need an index to convert character user_id fields into numbers? When it comes to the structure of the index of the B+ tree, we know that the index of the Btree is forked and sorted according to the value of the index. when we convert the index field, the value will change, such as the original A value, if the integer conversion may correspond to a B value (int (A) = B), then the index tree cannot be used, because the index tree is constructed according to A, not B. So you don't need an index.

Index optimization 1.change buffer

We know that when updating a piece of data, we should first determine whether the page of the data is in memory. If so, update the corresponding memory page directly. If not, we can only go to the disk to read the corresponding data page into memory, and then update it. What's the problem?

The action of reading to disk is a little slow.

If a lot of data is updated at the same time, then a lot of discrete IO may occur.

In order to solve the speed problem in this case, change buffer emerged. first of all, don't be misled by the word buffer. Change buffer will persist to disk in addition to being in the public buffer pool. When we update with change buffer, if we find that the corresponding data page is not in memory and do not go to disk to read the corresponding data page, but put the data to be updated into change buffer, when will the change buffer data be synchronized to disk? What if the read action occurs at this time? First of all, there is a thread in the background that periodically synchronizes change buffer data to disk. If the thread does not have time to synchronize, but a read operation occurs, it will also trigger the event of merge change buffer data to disk.

It should be noted that not all indexes can use changer buffer, such as the primary key index and the unique index can not be used, because of uniqueness, so they are updated to determine whether the data exists, if the data page is not in memory, you must go to the disk to read the corresponding data page into memory, and the general index does not matter, there is no need to verify uniqueness. The larger the change buffer, the greater the theoretical benefit, because first, there are fewer discrete reads of IO, and secondly, when multiple changes occur on a data page, you only need to merge to disk at once. Of course, not all scenarios are suitable for changer buffer. If your business is updated and needs to be read immediately, changer buffer will be counterproductive, because the need to constantly trigger merge actions will not reduce the number of random IO, but increase the cost of maintaining changer buffer.

two。 Index push-down

We talked about the federated index earlier. The federated index should meet the leftmost principle, that is, if the federated index is [A _ Magi B], we can use the index through the following sql:

Select * from table where A = "xx" select * from table where A = "xx" AND B = "xx"

In fact, federated indexes can also use the principle of leftmost prefix, that is:

Select * from table where A like "Zhao%" AND B = "Shanghai"

But it should be noted here that because of the use of part of A, before MySQL5.6, the above sql immediately returned to the table (using select *) after retrieving all the data that A starts with "Zhao", and then compared the judgment of whether B is "Shanghai". Isn't it a little confused here? Why doesn't B judge directly on the federated index, so that the number of returns to the table will be less? The reason for this problem is still due to the use of the leftmost prefix, which causes the index to use part A, but not B at all, which looks a bit "silly". So after MySQL5.6, there is the index push-down optimization (Index Condition Pushdown). With this function, although the leftmost prefix is used. However, we can also search the federated index to find out the data that meets the A% and filter the non-B data at the same time, which greatly reduces the number of returns to the table.

3. Refresh adjacency pages

Before we talk about refreshing adjacent pages, let's talk about dirty pages. We know that when updating a piece of data, we have to judge whether the page on which the data is located is in memory. If not, we need to read the data page into memory first, and then update the data in memory. At this time, we will find that the pages in memory have the latest data, but the pages on disk are still old data. At this time, the page in the memory where the data is located is the dirty page, which needs to be brushed to the disk to keep consistent. So the question is, when do you brush it? How many dirty pages are appropriate to brush each time? If you brush every change, then the performance will be very poor, if you brush for a long time, dirty pages will accumulate a lot, resulting in fewer pages available in the memory pool, thus affecting the normal function. Therefore, the speed of brushing should not be too fast but in time. MySQL has a cleaning thread that will be executed regularly to ensure that it will not be too fast. When there are too many dirty pages or the redo log is almost full, brushing will be triggered immediately to ensure timely.

In the process of brushing dirty pages, InnoDB has an optimization here: if the neighbor pages of dirty pages are also dirty, then brush along with them. This advantage is that random IO can be reduced. In the case of mechanical disks, the optimization should be quite large, but there may be holes. If the neighbor dirty pages of the current dirty pages are brushed in together, the neighbor pages will immediately become dirty again because of the change of data. Is there a feeling of superfluous action at this time, and it is a waste of time and expense? To make matters worse, this chain reaction may have a short-term performance problem if the neighbor of the neighbor page is also a dirty page.

4.MRR

In actual business, we may be told to use overlay indexes as much as possible and not to return to the table, because returning to the table requires more IO and takes longer, but sometimes we have to go back to the table, which will not only cause too much IO, but also too much discrete IO.

Select * from user where grade between 60 and 70

Now we want to query the information of users with scores between 60 and 70, so our sql is written as above. Of course, our grade field has an index. According to common sense, we will first find grade=60 on the grade index, then look for it on the primary key index according to the id corresponding to the grade=60 data, and finally go back to the grade index again and again, repeating the same action over and over again. Suppose now the id=1 corresponding to grade=60 The data is on page_no_1, grade=61 corresponds to id=10, data is on page_no_2, grade=62 corresponds to id=2, and data is on page_no_1, so the real situation is to find data on page_no_1 first, then cut to page_no_2, and finally switch back to page_no_1, but in fact, id=1 and id=2 can be merged and page_no_1 can be read once, which not only saves IO, but also avoids random IO. This is MRR. When using MRR, the secondary index will not go back to the table immediately, but put the resulting primary key id in a buffer, and then sort it, and then read the primary key index sequentially, which greatly reduces the discrete IO.

This is the end of the content of "the disadvantages of MySQL database index and how to use it". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.