How to solve the problem of classification and ranking in MySQL 04/29 Update SLTechnology News&Howtos

How to solve the problem of classification and ranking in MySQL

2025-04-29 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article shows you how to solve the problem of classification and ranking in MySQL. The content is concise and easy to understand, which will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

01 requirements introduction

Consider a classic application in MySQL: given a student's examination score table, it is necessary to rank students according to their grades. For simplicity, only the score table is given, regardless of the possible associated student information table, course information table, teacher information table, and so on, and only three key fields are created in the score sheet:

Cid: course id, int, a total of 5 courses

Sid: student id, int, a total of 8872 students

Score: score, int type, a total of 22366 score information, distributed between 10-100

For step-by-step analysis, no primary key is added or any indexes are built in the initial state.

02 subquery

The most direct idea to achieve this requirement is to count each score through a subquery: how many scores are higher in the statistics, then the ranking of that score is a higher score count + 1. If you want to distinguish the ranking of courses, you only need to add a constraint that limits the equality of course id in the statistical table.

1SELECT 2 a.scores, (SELECT COUNT (score) + 1 FROM scores WHERE cid = a.cid AND score > a.score) AS 'rank' 3FROM 4 scores a 5ORDER BY 6 a.cid, a.score DESC

It should be noted that score > a.score and COUNT () + 1 are required in the subquery constraints, indicating that the count + 1 is higher than the score, for example, for 90, 80, 80, 70... The ranking result for such a score is 1, 2, 2, 2, 4. If score > = a.score and COUNT () are selected as the ranking conditions, then the result is 1mem3pence3pr 4.

Without adding any indexes, this query is quite slow and takes 120s.

Subquery execution plan when no index is added

The first idea of optimizing a query is of course to add an index: although no where constraints are used in the outer query, two field judgments, cid and score, are used in the subquery, so consider adding an index:

1CREATE INDEX idc ON scores (cid); 2CREATE INDEX ids ON scores (score)

After adding the index, the query takes 96s, although it has been improved, it is still an embarrassing task. Explain the query plan and find that although the speed is still slow, both indexes are indeed applied:

Subquery execution plan after adding independent index

Since independent indexes cannot significantly improve efficiency, considering that the where condition in a subquery is not a constant constraint of independent fields, but a joint constraint that depends on the value of an outer loop, then consider adding a federated index:

CREATE INDEX idcs ON scores (cid, score)

The query speed is indeed faster, and it actually takes 24 seconds. Explain the query plan and find that both independent indexes and federated indexes are used. But I have to say that the response time of 24s is still not enough for immediate tasks that require 0.5s to resolve the battle.

Subquery execution plan after adding federated index

We have to find another way.

03 self-connection

Generally speaking, for slow subquery tasks, the join query (join) can be significantly improved.

Specific to the specific requirement of sub-course ranking, we consider self-joining the scores table, where the join condition is that the course is equal and the score value of table an is less than that of table b, so that the ranking information can be obtained by counting the number of records that meet the connection condition:

1SELECT 2 a.b.score, COUNT (b.score) + 1 AS 'rank' 3FROM 4 scores a LEFT JOIN scores b ON (a.cid = b.cid AND a.score < b.score) 5GROUP BY 6 a.cid, a.sid 7ORDER BY 8 a.cid, COUNT (b.score)

It should be noted that the connection method should be left join so that all the score information in table an is displayed; if join is used, the highest score is missed because there is no record that satisfies the connection. The relationship between the score value and count () in the join condition is similar to that in the subquery.

With the application of self-join, the query speed is similar to that of subquery without creating any index, which takes 73s; after adding a valid index, the query time is 27s, which is improved, but it is as efficient as the query scheme.

Self-join execution plan when no index is added

Self-join execution plan after adding a valid index

Obviously, the application of self-join instead of subquery does not significantly improve the query efficiency, even on the basis of adding a valid index.

Further analysis of the data table shows that, in fact, the slow speed does not deny the ability of the index to improve query efficiency, but only because there are fewer values for the fields added to the index: the cid field has only 5 values-when the number of field values is small, it is difficult to add an index.

For example, if you change the demand and rank the scores of each course by student (there are a large number of sid values), the query efficiency can be effectively improved by using the index. Rank by student query score SQL statement:

1SELECT 2 a.b.score, count (b.score) + 1 AS 'rank' 3FROM 4 scores a LEFT JOIN scores b ON (a.sid = b.sid AND a.score < b.score) 5GROUP BY 6 a.sid, a.cid 7ORDER BY 8 a.sid, count (b.score)

For the above query, when no index is added, the query time is 34s; after adding a valid index, the time is only 0.184s, and the improvement effect of adding index is very obvious.

Although this argument defends the status and role of the index, what if our need is to rank by curriculum? Obviously, both sub-query and self-join scheme are difficult to meet our real-time query needs.

We have to find another way.

04 Custom variable

In fact, the above two schemes are slow because they both act on queries on two tables, and if you consider the outer order by, the execution time complexity is roughly estimated to be on the order of O (N3). At this point, we consider applying custom variables to achieve a lower complexity query implementation.

By using custom variables, we can not only improve speed, but also achieve "various" rankings: for example, for a group of scores such as 90, 80, 80, 70, 60, there may be three kinds of ranking requirements, one is continuous ranking, and the ranking continues to increase at the same time: 1, 2, 3, 4, 5; the second is the same score with the same name, and the next ranking is not skipped, that is, 1, 2, 2, 3, 4. The third is the same score of the same name, the next ranking jump, that is, 1, 2, 2, 4, 5. These three requirements can be easily sorted with custom variables (the meaning and ideas of specific variables are given later):

Consecutive rankings:

1SELECT 2 sid, cid, @ curRank:=@curRank+1 AS 'rank' 3FROM 4 scores, (SELECT @ curRank:=0) tmp 5ORDER BY 6 score DESC

Same score and same name, do not skip the grade:

1SELECT 2 sid, cid, @ curRank:=IF (score=@preScore, @ curRank, @ curRank+1) AS 'rank', 3 @ preScore:=score 4FROM 5 scores, (SELECT @ curRank:=0, @ preScore:=NULL) tmp 6ORDER BY 7 score

Same score, same name, skipping grade:

1SELECT

2 sid, cid, @ curRank:= IF (score=@preScore, @ curRank, @ totalRank) AS 'rank', 3 @ preScore:= score, 4 @ totalRank:= @ totalRank+1 5FROM 6 scores, (SELECT @ curRank:=1, @ totalRank:=1, @ preScore:=NULL) tmp 7ORDER BY 8 score

The above SQL statement is ranked without any classification: three requirements are achieved by recording the previous ranking, the previous score value, and the current total ranking by custom variables (MySQL defines variables using @ as a leader and: = for assignment).

So, if you want to achieve a classified ranking, for example, to rank different courses? Then you just need to add another custom variable to record the previous course cid:

If the current classification information is the same as the previous course cid, continue with the current ranking process (choose one of the three rankings according to specific needs)

If the current category is different from the previous course cid, the ranking information is initialized and starts again from 1.

Taking the relatively complex "same division with the same name, skipping" as an example, the SQL statement is:

1SELECT sid, cid, 2 @ totalRank:= IF (cid=@preCid, @ totalRank+1, 1), 3 @ curRank:= IF (cid=@preCid, IF (score=@preScore, @ curRank, @ totalRank), 1) AS 'rank', 4 @ preScore:= score, 5 @ preCid:= cid 6FROM 7 scores, (SELECT @ curRank:=0, @ totalRank:=0, @ preScore:=NULL, @ preCid:=NULL) tmp 8ORDER BY 9 cid, score DESC 8 score

The meaning of each variable is explained as follows:

@ totalRank is used to record the total ranking in the current category, initialized to 0

@ curRank is used to record the current ranking in the current category, initialized to 0

@ preScore is used to record the last score, initialized to NULL

@ preCid is used to record the previous course cid, initialized to NULL

The execution process and conditions are judged as follows:

If the current cid is the same as the previous cid, it means that it is the same category, and the ranking is increased on the basis of the previous ranking. Specifically:

Total ranking + 1 each time

If the current score is the same as the previous score, the current ranking will remain the same; otherwise, skip to the overall ranking

If the current cid is different from the previous cid, it means to start a new course ranking. Both the total ranking and the current ranking are initialized to 1.

Based on the above SQL statement, it takes only 0.09s to execute the same task, which is 266 times of the fastest speed of 24s of subquery and 300times of the fastest speed of 27s of self-join.

In addition, since there are no where constraints in the above SQL statement, it has nothing to do with indexing.

05 MySQL8.0 window function

An important update to the MySQL8.0 version is the addition of window functions that make the previous classification and ranking requirements extremely simple.

Similar to the above, different ranking requirements have different window functions, and the naming of the three functions is very intuitive:

Continuous ranking: row_number (), ranking is the line number

Same score, same name, no skipping: dense_rank (), dense ranking, similar to 1, 2, 2, 3... This kind, because it does not skip grades, it is relatively "dense".

Same score, same name, skipping: rank (), general ranking, similar to 1, 2, 2, 4... This

Among them, each window function must be used with the over () function, and the parameters in the over () function mainly include partion by and order by:

Order by: consistent with order by in a regular SQL statement, indicating that it is sorted by a certain field and distinguishes between ASC and DESC

Partion by: used as the basis for classification. By default, all records are sorted without classification. If a field is specified, it is sorted independently between the fields and starts again across fields.

Taking the previous sub-course ranking requirements as an example, the SQL statement is as follows:

1SELECT 2 *, RANK () OVER (PARTITION BY cid ORDER BY score DESC) AS 'rank' 3FROM 4 scores

The query takes 0.066s, which is slightly faster than the ranking achieved by custom variables. At the same time, the ranking method has nothing to do with the index.

Replace RANK () with the other two window functions to achieve other corresponding requirements.

The above content is how to solve the problem of classification and ranking in MySQL. Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.