What is the underlying principle of MySQL index? 07/03 Update SLTechnology News&Howtos

What is the underlying principle of MySQL index?

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article will explain in detail what is the underlying principle of MySQL index, the content of the article is of high quality, so the editor will share it with you for reference. I hope you will have a certain understanding of the relevant knowledge after reading this article.

Index type

From the implementation of the index, we can divide it into clustered index and non-clustered index, or secondary index or secondary index, and from the practical application of the index, it can be subdivided into general index, unique index, primary key index, joint index, foreign key index and full-text index.

InnoDB can be thought of as a clustered index because the leaf nodes of its B+ tree contain complete data records. The data file of InnoDB itself is the index file, and the table data file itself is an index structure organized by B+Tree. The data domain of the leaf node of this tree keeps the complete data record. The key of this index is the primary key of the data table, so the InnoDB table data file itself is the primary index. The secondary index data domain of InnoDB stores the value of the primary key of the corresponding record instead of the address. In other words, all secondary indexes of InnoDB refer to the primary key as the data domain.

However, the leaf node of MyISAM B+ tree only stores the address of the data, so it is called nonclustered index. The MyISAM engine uses B+Tree as the index structure, and the data domain of the leaf node stores the address of the data record; in MyISAM, there is no structural difference between the primary index and the secondary index (Secondary key), except that the primary index requires that the key is unique, while the key of the secondary index can be repeated.

In InnoDB, there are clustered index and general index. The clustered index is constructed according to the primary key, and the leaf node stores this row of records corresponding to the primary key. According to the primary key query, the clustered index can be directly used to locate the record. The ordinary index is built according to the column when the index is declared, and the leaf node stores the value of the primary key corresponding to this row of records. According to the needs of the ordinary index query, first find the corresponding primary key value on the ordinary index, and then look up the record on the clustered index according to the primary key value, commonly known as back to the table. If we query a whole row of records, we must look it up on the clustered index, and if we only need to query the primary key values according to the ordinary index, because these values already exist on the ordinary index, there is no need to return to the table. This is called index coverage, which can improve the query efficiency to a certain extent.

There are two special cases of unique index and federated index in the ordinary index. when the unique index is inserted and modified, it will verify whether the value of the corresponding column of the index already exists. the federated index splices the values of the two columns in the order in which they are declared before building the index.

Data row is not the minimum storage unit managed by the storage engine, the index can only help us locate a data page, the smallest unit of each disk read and write is also a data page, and multiple data rows are stored in a data page, we need to understand the internal structure of the data page in order to know how the storage engine locates to a certain data row, you can refer to the MySQL Storage Management https://url.wx-coder.cn/IF5HH series.

Index selectivity

For index column and string prefix length, we refer to the index of selectivity (Selectivity) to determine: the selectivity is defined as the ratio of non-repeated index value to the total number of records of data, the higher the selectivity, the higher the query efficiency of the index. For example, for parameters such as gender, there is no point in establishing an index at all.

Index Selectivity = Cardinality / # T

Obviously, the value range of selectivity is (0,1), and the higher the selectivity, the greater the index value, which is determined by the nature of B+Tree. In a real database, we can calculate the selectivity of a column with the following statement:

SELECT count (DISTINCT (title)) / count (*) AS Selectivity FROM titles

Primary key

Within InnoDB, the table data is arranged and distributed by optimizing the fast query of the primary key, and its search speed is the fastest. The logical order of the key values in the index determines the physical order of the corresponding rows in the table. Even if there is no column suitable for the primary key in the table, it is recommended to use an automatically growing integer primary key (surrogate key), so that the table is stored sequentially when adding data, and it will be optimized later when other tables refer to the foreign key query.

If the primary key (Primary Key) is not explicitly defined when the table is created, the InnoDB storage engine selects or creates the primary key as follows:

First, whether there is a non-empty unique index (Unique NOT NULL) in the table, and if so, the column is the primary key.

If the above conditions are not met, the InnoDB storage engine automatically creates a 6-byte pointer that the user cannot view or access.

Selection of primary key

In the article "distributed ID https://url.wx-coder.cn/tQ5eH", we discuss the selection strategy of distributed ID in over-deployment scenarios, and we also have the same consideration in the database. First of all, the MySQL official has a clear recommendation that the primary key should be as short as possible. The 36-character UUID does not meet the requirements. If the primary key is a very long string and a lot of common indexes are built, ordinary indexes will take up a lot of physical space. And the primary key * * is increased sequentially, otherwise under the InnoDB engine, the disorder of UUID may cause frequent changes in data location and seriously affect performance.

Self-increasing ID can guarantee that two adjacent records may be in the same data block when inserting, while business-related continuity design such as order number may not be as good as self-increasing ID, resulting in continuous insertion in multiple data blocks, increasing the number of disk reads and writes.

Uniqueness: self-adding ID is easy to be cracked by force, and conflicts are inevitable during data migration, especially when table merging occurs. UUID can guarantee uniqueness and avoid conflicts completely.

Key length: the length of the self-increasing field is much smaller than that of UUID, which will have a great impact on the performance of retrieval. When Innodb engine retrieves data, it first finds the primary key according to the index, and then finds the record according to the primary key; in this way, when the length of the primary key is short, it will have better read performance.

Concurrency: in the case of self-increasing ID and high concurrency, competitive self-increasing locks will reduce the database throughput. On the other hand, UUID can generate UUID in the application layer to improve the throughput of the database.

Database index: table data in InnoDB is stored in primary key order, so if random IO occurs when writing data, the disk blocks will be moved frequently. When the amount of data is large, the writing deficiency will be very obvious. The new data in the self-increasing ID can be arranged in order by default, which has a great improvement in performance, while in UUID, there is no order between the primary keys.

Primary key and unique index

The primary key is the unique index, but the unique index is not necessarily the primary key, the unique index can be empty, but there can be only one null value, and the primary key cannot be empty. For single-column indexes, all data in that column is required to be different, but NULL values are allowed; for federated indexes with multiple columns, the combination of these columns is required to be unique. The unique index itself can be used as an index, in practice, it can also be used to generate data constraints to prevent the addition or modification of the same data, so as to ensure the integrity of the data.

For string types, you can specify the index prefix length (and is required for the BLOB/TEXT prefix length parameter), the maximum prefix length in the InnoDB table is 767 bytes, and the parameter M is measured in bytes. Therefore, if the string is too long, it is too wasteful to build a B+Tree index, so it is a method to simulate the HASH index manually, but this way does not have the flexibility to query strings using prefixes (such as LIKE operations).

Joint index

A single-column index refers to an index established on a table for a certain field. generally, the creation of an index to choose an integer or a smaller fixed-length string will be more beneficial to improve efficiency. A federated index refers to an index in which multiple fields are organized in a certain order. Taking the index (name, city, gender) as an example, it is first organized according to the order of name fields. When the values of name fields are the same (such as Bush), they are organized according to the order of city fields, and when the values of city fields are the same, they are organized according to the gender field. Because the federated index is built by multiple columns, sometimes we can add fields that need to be queried frequently to the federated index. For example, we often need to find age based on name. We can build a joint index of name and age.

Common conditional unions include WHERE conditional federation and ORDER BY conditional federation; the so-called WHERE conditional union means that for the equivalent condition in the WHERE condition, the field use is the same as that of the joint index (the order can be inconsistent).

ORDER BY federation means that if the field after ORDER BY is a field after the federated index overrides the where condition, because the index is already in an ordered state, MySQL will read the ordered data directly from the index, and then organize the data in that order after reading the data on disk, thus reducing the operation of sorting disk data. That is, for queries that do not cover ORDER BY, there is a Creating sort index, that is, it takes * to sort disk data; for queries that overwrite ORDER BY, it does not need to be sorted, and its time-consuming is mainly reflected in the process of pulling data from disk.

Prefix index

The prefix index of MySQL can be divided into three categories: joint index prefix, like prefix and string prefix.

Union index prefix matches leftmost (Leftmost Prefix)

The federated index prefix means that when building a multi-column index, all or part of the index columns must be used from left to right in order to make full use of the federated index, for example: (col1, col2, col3) use (col1), (col1, col2), (col1, col2, col3) valid. Match to the right in the query statement until a range query (>, 'Chicago' and interest='baseball') is encountered For this query condition, you can first filter the non-Bush data of * fields in the index slice according to the name field, and then locate the Chicago position of the index slice according to the second field of the federated index. Because it is a non-equivalent condition, the MySQL will scan sequentially from the located Chicago. Because the interest field may be scattered anywhere in the third field of the index, the third field cannot participate in the filtering of the index slice.

Therefore, the column order of B-Tree is very important, and the above usage rules are all related to the column order. For practical applications, it is generally necessary to create indexes of different columns and different column order according to specific requirements. Suppose there is an index Index (A _ maeb _ r _ C):

# use index A > 5 AND A6-leftmost prefix match Abl5 AND Barrier 6 AND prefix 7-full column match Ably5 AND B IN (2 AND 3) AND C > 5-fill pit # cannot use index B > 5-does not contain leftmost prefix Bamboo 6 AND prefix 7-does not contain leftmost prefix # use partial index A > 5 AND 2-use index A column 5 AND B > 6 AND prefix 2-use columns An and B that use the index

Using an index to sort the results requires that the order of the index is the same as that in the ORDER BY clause, and the ascending and descending order of all columns is the same (ASC/DESC). If the query joins multiple tables, only the columns in ORDER BY refer to * * tables (you need to JOIN them in order).

# use index sort ORDER BY A-leftmost prefix match WHERE Agg5 ORDER BY Bdirection C-leftmost prefix match WHERE Aq5 ORDER BY B DESC-leftmost prefix match WHERE A > 5 ORDER BY Achievement B-leftmost prefix match # cannot use index sort WHERE Apost5 ORDER BY B DESC,C ASC-inconsistent ascending and descending order WHERE Ascen5 ORDER BY B D-D is not in the index WHERE Agg5 ORDER BY C-does not contain the leftmost prefix WHERE A > 5 ORDER BY BForce C-* column is a range condition Cannot use BC to sort WHERE Agg5 AND B IN (1,2) ORDER BY C-B is also a range condition, cannot be sorted with C.

Like prefix

For the like prefix, this means that when using a like query, the index of the first_name field can be used if the expression used is first_name like 'rMq%';. However, first_name 's index cannot be used for first_name like'% Chu%';,. For the like prefix, the underlying MySQL actually uses a completion strategy to use the index. For example, first_name like 'rMq%';,MySQL completes it to two pieces of data: rMqAAAAA and rMqzzzzz, and the length of the completion part is the * * length of the current field. When using an index query, MySQL uses these two pieces of data for index positioning, and the result set required by * * is the data in the middle of the two anchor points. The following is a schematic diagram of using the like prefix:

String prefix

A string prefix index refers to an index established by taking only the first few characters of a string. When querying, if the value of a field is long, the cost of indexing it will be very high, and the query efficiency will be relatively low. String prefix index exists to solve this problem. String prefix indexing is mainly used in two aspects:

The selectivity of the field prefix is relatively high.

The overall selectivity of the field is not too large (if the overall selectivity of the field is large, you can use hash indexing).

For example, if the first_name field is indexed with a prefix of length 4, you can see that if the query uses where first_name='qWhNIZqxcbD' Then MySQL will first intercept the first four characters of the equivalence condition, and then compare it with the string prefix index to locate the index slice with the prefix "qWhN", then obtain the disk data corresponding to the index slice, and then compare the first_name field of the disk data with the value of the query equivalent condition to get the result set.

One of the most important problems in string prefix indexing is how to select the length of the prefix. When the length is appropriate, the filtering of the prefix index will be almost equal to the selectivity of indexing the whole field. Here we need to use the concept of field selectivity explained earlier, that is, after the field selectivity is grouped into the field, the amount of data of the group with the amount of data accounts for the proportion of the total amount of data. When selecting the prefix length here, it can be understood that the selectivity of the prefix is the proportion of the group with the amount of data to the total amount of data after being grouped according to the prefix. The SQL formula for calculating the prefix length is shown in the following table:

Select count (*) as cnt, first_name as perf from actor group by perf ORDER BY cnt desc limit 10;-0 select count (*) as cnt, left (first_name, 2) as perf from actor group by perf ORDER BY cnt desc limit 10;-2 select count (*) as cnt, left (first_name, 3) as perf from actor group by perf ORDER BY cnt desc limit 10;-3 select count (*) as cnt, left (first_name, 4) as perf from actor group by perf ORDER BY cnt desc limit 10 -- 4

Other indexes

Overlay index

An override index refers to an index used in a query that removes all fields that participate in index filtering scans and adds them to the end of the index used by the query. The advantage of overwriting index scanning is that because all the fields used in the query are in the same index, you only need to get the relevant data in the index instead of going back to disk to scan the corresponding data. thus, it avoids the most time-consuming disk ISynO read in the query. For the following query:

Select a, b, c from t where aura 'and baud'

If a federated index (a, b, c) is established in this query, then the override scan index is used, because for this query, the first two fields an and b of the index can be used to filter the index slice according to the where condition, and the filtered index slice can read the values of the three fields a, b, c directly in the index without going back to the table scan.

Samsung index

A three-star index refers to a query in which three general index conditions are satisfied. every time a condition is satisfied for a particular query, the index gets a star. when the index gets three stars, it means that the index is a three-star index for the query. The Samsung index is a * * index for a specific query. The conditions for establishing the Samsung index are as follows:

Take out the columns of all equivalent predicates (WHERE COL=... ) the column at the beginning of the index

Add columns from ORDER BY to the index

Add the remaining columns in the query statement to the index and put the easy columns into * to reduce the update cost.

For example, for a query like this, the index (first_name, last_name, email) is a Samsung index:

SELECT first_name, last_name, email FROM user WHERE first_name = 'aa' ORDER BY last_name

The following rules can be found in the process of creating a Samsung index:

Overriding equivalent predicate conditions, such as first_name, can filter most of the index slice data

Overriding the order by field avoids sorting the result set, such as last_name

Overwriting the remaining fields avoids going back to disk to read data, even if override index scans, such as email, are used.

Index storage structure

When MySQL queries, it will first locate the corresponding data page through the index, and then check whether the data page is in the buffer pool. If it is, return it directly. If not, read the corresponding data page through disk IO in the de-clustering index and put it into the buffer pool. A data page can contain multiple rows of data. The cache pool manages the data pages through the LRU algorithm, that is, the most frequently used data pages are at the front of the list, the infrequently used data pages are at the end of the queue, and the data pages that fall behind will be eliminated when the buffer pool is full. Newly read data pages from disk are not placed at the head of the queue but in the middle, which can be modified by parameters. The buffer pool can also be set up with multiple instances, and the data page is placed in which buffer pool according to the hashing algorithm.

In the article MySQL storage structure, we discussed the storage structure of MySQL data pages.

Memory Architecture | memory architecture

The memory of InnoDB mainly consists of the following parts: buffer pool (buffer pool), redo log buffer pool (redo log buffer), and additional memory pool (additional memory pool), as shown in the following figure:

Among them, the buffer pool accounts for * block memory, which is used to cache their respective data. Data files are read into the buffer pool by page (16K per page), and the cached data is retained according to the least recently used algorithm (LRU). The data types of buffer pool buffers are: data pages, index pages, insert buffers, adaptive hash indexes, lock information, data dictionary information, etc., in which data pages and index pages occupy most of the memory. Log buffering puts the redo log information into this buffer and then flushes it to the redo log file at a certain frequency (the default is 1s).

InnoDB processes related operations asynchronously through a series of background threads, while reducing the difference in CPU and disk speed with the help of buffer pools. When querying, it will first locate the corresponding data page through the index, and then check whether the data page is in the buffer pool. If it is, return it directly. If not, read the corresponding data page through disk IO in the de-clustering index and put it into the buffer pool. A data page can contain multiple rows of data. The cache pool manages the data pages through the LRU algorithm, that is, the most frequently used data pages are at the front of the list, the infrequently used data pages are at the end of the queue, and the data pages that fall behind will be eliminated when the buffer pool is full. Newly read data pages from disk are not placed at the head of the queue but in the middle, which can be modified by parameters. The buffer pool can also be set up with multiple instances, and the data page is placed in which buffer pool according to the hashing algorithm.

Storage Architecture | Storage structure

The logical storage structure of the InnoDB storage engine is roughly the same as that of Oracle, and all data is logically stored in one space, which we call a tablespace. The tablespace consists of segment, extent, and page. Pages are sometimes called block in some documents, and the logical storage structure of the 1 extent = 64 pages,InnoDB storage engine is roughly as shown in the figure:

As the layer of the storage structure, all the data is stored in the table space. By default, a shared table space ibdata1 is used. If innodb_file_per_table is enabled, the data of each table will be stored in a separate table space, that is, each table will have a file.

The table space is composed of segments, the InnoDB storage engine is organized by the index, and the leaf nodes in the index are used to record data and store in the data segment, while the non-leaf nodes are used to build the index and store in the index segment. An area is composed of consecutive pages, in any case an area is 1MB, there can be multiple pages in an area, and each page defaults to 16KB, so by default an area can contain 64 consecutive pages, the page size can be set through innodb_page_size, and specific row records are stored in the page. A row of records is eventually stored in a file in binary form.

Physically, InnoDB tables consist of shared tablespaces, log filegroups (or, more accurately, Redo filegroups), and table structure definition files. If innodb_file_per_table is set to on, each table will generate a tablespace file independently, ending with ibd, and the internal data dictionary information of the data, index, and table will be saved in this separate tablespace file. The table structure definition file ends with frm, which is independent of the storage engine, and the table structure definition file of any storage engine is the same as a .frm file.

Process Architecture | process architecture

By default, InnoDB has seven background threads, including four IO thread, one Master thread, one Lock monitor thread, and one Error monitor thread. The main work of InnoDB is done in a separate Master thread. The priority of the Master thread is * *, which is mainly divided into the following loops: main loop (loop), background loop (background loop), refresh loop (flush loop), and pause loop (suspend loop).

The pseudo code of the main loop is as follows:

Void master_thread () (loop: for (int I = 0; I)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.