How to understand and realize the principle and optimization of index 07/06 Update SLTechnology News&Howtos

How to understand and realize the principle and optimization of index

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "how to understand and realize the principle and optimization of index". The content of the explanation in this article is simple and clear, and it is easy to learn and understand. let's study and learn how to understand and realize the principle and optimization of index.

Part1 Why does Kafka not need us to care about indexes while Mysql does?

Although the final data of Kafka and MySQL are both on disk, there are great differences between them in use and data query mode, which determines the difference in data storage structure and the complexity of the index.

Let's first take a look at the storage structure of kafka:

Because the positioning of kafka is for stable high-performance data reading and writing. So for the disk, it is read and written sequentially, which falls in some .log files and is named after the base offset 0.

In order to achieve high-speed search kafka, a sparse index file is created (one piece of data is created every other piece of data, rather than the full amount), that is, the index file. Where the physical location of the offset and .log files of the message is maintained. Quickly locate the log file through binary search and scan sequentially to find the target.

Therefore, the index organization of kafka is relatively simple and the scheme is relatively fixed, but MySQL is not. Mysql is a relational database, which is created to support complex business data query. There are a variety of query methods and data acquisition requirements, which requires MySQL to have a more complex index mechanism to accelerate complex business query scenarios.

How Part2MySQL data is organized [1] [2]

In terms of the InnoDB storage engine, the mysql data store:

After referring to the materials of three books, I basically summarized the most important parts.

The data is divided into multiple logical layers: row-> page-> block-> segment-> tablespace.

We know that InnoDB storage engine tables are Index organized (data is index, index is data), they are all maintained on a B+ tree, data segments are leaf nodes, index segments are non-leaf nodes.

In fact, the segments and chunks we divide are designed to make use of the resources of the operating system (for example, the data size loaded into memory from disk is specified by chunks, etc.) to achieve the purpose of more efficient reading and writing.

The page is the smallest unit of interaction between MySQL and disk, how to find rows from the page, how to aggregate to blocks, to segments and then to space.

1 the smallest unit of data recording-row

The structure of a record extracted from the general picture above is as follows:

We can see that in addition to the line number, there is also the identification next_record of the next record in the record header, so we can connect the records through next_record in the form of an one-way linked list, so this determines that when we look for a record in the record chain, we can only traverse sequentially, which also determines that a data chain will not be too long.

But a page default is 16K, plus row overflow and other processing, a page stores up to 7992 rows of records, so many records, must be traversed sequentially? Of course not. Let me see how the page organizes the record lines.

2 minimum unit of interaction with disk-page

As the smallest unit of interaction with the disk, it is used to store the actual data (the page type is b-tree Node to store the real data, and other types such as index directory pages are used to speed up the query.) you can roughly see the overall structure of a page from the larger image above:

Let's look at a few key field parameters:

Page Directory determines the query efficiency of record items on the page.

For faster query, the page catalog stores the data catalog (slot) of the page, which contains the offset of the maximum and minimum records and the maximum records of the packet data link. It is convenient to use dichotomy to find data quickly, and there is no need to start traversing from the minimum value, as shown below:

The picture is from "understanding MySQL from the Root."

File Header determines how pages are associated with each other

Record some general information on this page, mainly including

< 本页页号、上一页、下一页、页类型、所属表空间等等>

Find this page by page number, concatenate two-way linked lists through upper and lower pages, and determine whether it is an index page or a data page by type. no, no, no.

The picture is from "understanding MySQL from the Root."

This field determines that pages can be easily associated with each other through the above properties.

Page Header determines the level of the page

The data information stored on this page mainly contains *.

With the concept of data organization on the page, how to use these structures to achieve fast data query?

The Evolution of Part3 Index

As you can see from the knowledge of the data organization above, row records are concatenated into an one-way linked list, grouped between the minimum and maximum records on each page.

The pages are concatenated into a two-way linked list through the pointers of the previous page and the next page, which are stored on disk, as shown below:

So, what can I do to query a record?

3 original: sequential mode

As shown in the above figure, the data concatenation mode naturally provides a way of query: traversing each page and the record rows in the page in primary key order.

However, this query method is inefficient except for dichotomy optimization in the page. What shall I do?

Looking for improvement: since row records within pages can be grouped into slots, why not between data pages?

4 improvement: directory mode

We gather the pages up to build a page number directory, first look in the catalog, and then look in the corresponding pages, which is much faster than sequential search.

Looking for improvement: this approach requires a lot of contiguous space + directories will change frequently as the data changes, what should I do?

5 Evolution: primary key B + tree mode

In fact, when describing the row record structure, we see that there is a lot of extra space in the data row structure in addition to the actual business data.

For example, record_type is used to indicate whether the type of record is data or index. It is the design of these extra spaces that enables InnoDB to organize indexes in a more appropriate way:

The picture is from "understanding MySQL from the Root."

This is a B+ tree. The page nodes are hierarchical and the row records in the page are typed.

Business data is contained in leaf nodes, and catalog data is contained in other non-leaf nodes.

The advantage of this organization is that it allows enough levels to hold enough data items (which can be estimated by simply assuming the size of each page).

And this indexing method is what we often call clustered index. Even if records and pages are sorted with primary key values, and the leaf node contains all user data.

Looking for improvement: what if I want to query with other columns?

6 expansion: secondary index, federated index

Secondary index

For example, if the user needs to query based on the value of a column (column a), then recreate a B+ tree. The difference between this index tree and a clustered index tree is that the index node takes the value of column an as the directory, and the leaf node contains only the value of column an and the primary key.

If the user needs to query more information than column c, he or she needs to use the primary key ID to do another aggregate indexing, also known as returning to the table.

Joint index

A secondary index is a single-column index except the primary key, while a federated index is a common sort of multiple columns. Suppose the user needs to make an ordered query with two columns an and b, which means that the value of b is judged if the value of column an is the same.

Like the secondary index, InnoDB also needs to create another B + tree, and the catalog items are sorted first a, then b in series, and the data items of the leaf node contain only three values: a, b, and primary key.

Analogy of Part4 production practice 7 Meituan timed task index optimization [3]

The system needs to regularly process the tasks of a specific state, a specific type and a specific operator in a specific period of time.

Select * from task where status=x and operator_id=xxxx and operate_time > xxxxxxxx01 and operate_time

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.