What is the reason for the slowness when using limit,offset paging scenarios 07/02 Update SLTechnology News&Howtos

What is the reason for the slowness when using limit,offset paging scenarios

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces the reasons for the slow use of limit,offset paging scenes, the article is very detailed, has a certain reference value, interested friends must read it!

Start with a question.

When I was in Tencent five years ago, I found that the speed of mysql requests was very slow in the paging scenario. When the amount of data is only 10w, the select xx from stand-alone machine is about 2 minutes 3 seconds.

I asked my master why, and he asked, "what is the time complexity of indexing the scene and getting the nth largest number in mysql?"

The pursuit of the answer

Confirm the scene

Suppose status has an index on it. Select * from table where status = xx limit 10 offset 10000.

Will be very slow. When the amount of data is small, there is a delay of a few seconds.

Xiaobai answered

At that time, there was a lot of sense of security, and Master took care of everything. Anyway, the skill was the worst in the group, so I guessed a log (N) and thought that to find a node was not log (N). Naturally, Master asked me to study it myself.

This stage took 10 minutes.

Continue to answer.

If you analyze it carefully, you will find it awkward to find it through the index. Because you do not know the distribution of the first 100 numbers in the left and right subtrees, it is impossible to take advantage of the search characteristics of the binary tree.

Through learning, we know that the index of mysql is a b + tree.

After looking at this picture, it suddenly became clear. The 100th largest tree can be found directly through the linked list of leaf nodes with o (n) complexity. But even o (n) is not excruciatingly slow, whether there is a reason.

This stage, mainly through the online search of information, intermittently used for 10 days.

Systematic learning

Two books are recommended here, a "MySQL Technology Insider InnoDB Storage engine", through which he can have a deeper understanding of the implementation mechanism of InnoDB, such as mvcc, index implementation, and file storage.

The second book is "High performance MySQL", which starts from the use level, but talks more deeply, and mentions a lot of design ideas.

By combining the two books and understanding them over and over again, mysql can barely enter the room.

Here are two key concepts:

Clustered index: contains the primary key index and the corresponding actual data. The leaf node of the index is the data node.

Secondary index: can be understood as a secondary node, its leaf node or index node, including the primary key id.

Even if the first 10000 will be thrown away, mysql will check the data on the clustered index through the primary key id on the secondary index. This is 10000 random io, which naturally slows down into huskies.

Questions may be raised here as to why there is such behavior, which has something to do with the layering of mysql, and limit offset can only act on the result set returned by the engine layer. In other words, the engine layer is also innocent, and he doesn't know that the 10000 are going to be thrown away.

The following is a hierarchical diagram of mysql, and you can see that the engine layer and the server layer are actually separate.

Until now, I probably understood the reason for the slowness. This stage took a year.

grasp a typical example and you will grasp the whole category

At this time has been working for 3 years, but also began to look at some source code. After watching etcd, I saw some source code of tidb. No matter which kind of database, in fact, the query of a statement is made up of logical operators.

Introduction to logical operators

Before writing specific optimization rules, briefly introduce some logical operators in the query plan.

DataSource this is the data source, that is, the table, t in select * from t.

Selection selection, such as the where filter condition in select xxx from t where xx = 5.

Projection projection. Taking c column in select c from t is a projection operation.

Join connection, select xx from T1, T2 where T1. C = T2. C means to Join the two tables of T1 and T2.

Selection, projection, and join (SPJ for short) are the most basic operators. Among them, Join has many connection methods, such as internal connection, left outer connection, right outer connection and so on.

After select b from T1, T2 where T1 c = T2 c and T1 > 5 becomes a logical query plan, the DataSource corresponding to T1 T2 is responsible for pulling up the data.

Above with a Join operator, join the results of the two tables according to t 1.c = t 2.c, then press t 1.a > 5 to do a Selection filter, and finally project the b column.

The following figure is an unoptimized representation:

So it's not that mysql doesn't want to pass limit and offset to the engine layer, but because it divides the logical operator, it's impossible to know how much data the specific operator contains.

How to solve the problem?

High performance MySQL mentions two schemes.

Option one

According to the actual needs of the business, to see if it can be replaced with the next page, the function of the previous page, especially on the ios, android side, the previous kind of complete paging is not common.

Here, replace limit, offset, with > secondary index (that is, search criteria) id. When the id is called again, it needs to be returned to the front end.

Option 2

The front is rigid. Here is a concept: index coverage: when the data queried by the secondary index is only id and the secondary index itself, then there is no need to look up the clustered index.

The idea is as follows: select xxx,xxx from in (select id from table where second_index = xxx limit 10 offset 10000) this sentence is to find the unique id value of the database corresponding to the data from the conditional query, because the primary key is already on the secondary index, so there is no need to return to the disk of the clustered index to pull. Then query the clustered index through these 10 primary key id which have been limit. This will only lead to ten random io.

In cases where the business does need paging, using this scheme can greatly improve performance. It usually meets the performance requirements.

The above is all the contents of this article entitled "Why are you slow when using limit,offset paging scenarios?" Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.