In-depth analysis: a peek into the MySQL optimizer from the source code 09/21 Update SLTechnology News&Howtos

In-depth analysis: a peek into the MySQL optimizer from the source code

2025-09-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Author | Tang Aizhong, developer of Yunhe Enmo SQM, an important contributor to Oracle/MySQL/DB2 's SQL parsing engine, SQL auditing and intelligent optimization engine, products are widely used in financial, telecommunications and other industry customers.

Abstract

The optimizer is the interpreter of logical SQL to physical storage, is a complex and "stupid" mathematical model, its input parameters are usually SQL, statistics and optimizer parameters, and the output is usually an executable query plan, so the quality of the optimizer depends on the stability and robustness of the mathematical model. Understanding this mathematical model can understand the SQL processing flow of the database.

01 execution flow of optimizer

(note: this picture is from Li Haixiang)

The figure above shows the general execution process of the optimizer, which can be simply described as follows:

1 construct the initial table access array (init_plan_arrays) according to the syntax tree and statistics.

2 access the array according to the table, calculate the best access path (find_best_ref) for each table, and save the current optimal execution plan (COST minimum)

3 if a better execution plan is found, the optimal execution plan will be updated, otherwise the optimization will end.

As can be seen from the above process, the generation of the execution plan is a process of "dynamic programming / greedy algorithm". The dynamic programming formula can be expressed as: Min (Cost (Tn+1)) = Min (Cost (T1)) + Min (Cost (T2)) +. Min (Cost (Tn-1)) + Min (Cost (Tn)), where Cost (Tn) represents the cost of accessing table T2 all the way to Tn. If the optimizer does not have any prior knowledge, it needs to do A (n) loop, which is a full permutation process. It is obvious that the optimizer has a priori knowledge, such as table size, outer join, subquery, etc., which will make the access of the table partially orderly, which can be understood as a "reduced" dynamic programming. The core function of the dynamic rule is: Join::Best_extention_limited_search. The code structure in the source code is as follows:

Bool Optimize_table_order::best_extension_by_limited_search (table_map remaining_tables, uint idx, uint current_search_depth) {for (JOIN_TAB * * pos= join- > best_ref + idx; * pos; pos++) {. Best_access_path (s, remaining_tables, idx, false, idx? (position-1)-> prefix_rowcount: 1.0, position); If (best_extension_by_limited_search (remaining_tables_after, idx + 1, current_search_depth-1). Backout_nj_state (remaining_tables, s);.}}

The above code is a recursive search in a for loop, which is a typical full-permutation algorithm.

02 optimizer parameters

MySQL's optimizer is still naive for Oracle. Oracle has a variety of rich statistical information, such as system statistics, table statistics, index statistics, etc., while MySQL needs more constants. MySQL5.7 provides table mysql.server_cost and table mysql.engine _ cost, which can be configured for users to adjust the optimizer model. Here are a few common and very important parameters:

1 # define ROW_EVALUATE_COST 0.2f

Calculate the cost of eligible rows. The more rows, the greater the cost.

2 # define IO_BLOCK_READ_COST 1.0f

The cost of reading a Page from disk

3 # define MEMORY_BLOCK_READ_COST 1.0f

The cost of reading a Page from memory, for Innodb, represents the cost of reading a Page from a Buffer Pool, so the default cost for reading a memory page and a disk page is the same

4 # define COND_FILTER_EQUALITY 0.1f

The default value of the equivalent filter condition is 0.1. for example, if the name = 'lily', table size is 100, 10 rows of data will be returned.

5 # define COND_FILTER_INEQUALITY 0.3333f

The default value for a non-equivalent filter condition is 0.3333, for example, col1 > col2

6 # define COND_FILTER_BETWEEN 0.1111f

The default value for Between filtering is 0.1111f, for example: col1 between an and b

There are many such constants, involving various costs such as filter conditions, JOIN cache, temporary tables, and so on. After understanding these constants, you will feel enlightened when you see the Cost of the execution plan!

03 optimizer options

In MySQL, execute select @ @ optimizer_trace to get the following parameters:

Index_merge=on,index_merge_union=off,index_merge_sort_union=off, index_merge_intersection=on, engine_condition_pushdown=on, index_condition_pushdown=on, mrr=on,mrr_cost_based=on,block_nested_loop=on,batched_key_access=off,materialization=on,semijoin=on,loosescan=on,firstmatch=on,subquery_materialization_cost_based=on, use_index_extensions=on, condition_fanout_filter=on

How is 04 Optimize Trace generated?

In the functions in the flowchart, there is a lot of code like this:

Opt_trace_object trace_ls (trace, "searching_loose_scan_index")

Therefore, during the operation of the optimizer, the execution path of the optimizer is also saved in Opt_trace_object, and then saved in information_schema.optimizer_trace, which is convenient for users to query and track.

Typical usage scenarios of 05 optimizer

5.1 full table scan

Select * from sakila.actor

Table actor statistics are as follows:

Db_name

Table_name

Last_update

N_rows

Cluster_index_size

Other_index

Sakila

Actor

2018-11-20 16:20:12

two hundred

one

The primary key actor_id statistics are as follows:

Index_name

Last_update

Stat_name

Stat_value

Sample_size

Stat_description

PRIMARY

2018-11-14 14:25:49

N_diff_pfx01

two hundred

one

Actor_id

PRIMARY

2018-11-14 14:25:49

N_leaf_pages

one

NULL

Number of leaf pages in the index

PRIMARY

2018-11-14 14:25:49

Size

one

NULL

Number of pages in the index

Execute the plan:

{"query_block": {"select_id": 1 "cost_info": {"query_cost": "41.00" } "table": {"table_name": "actor" "access_type": "ALL", "rows_examined_per_scan": "rows_produced_per_join": 200,200 "filtered": "100.00" "cost_info": {"read_cost": "1.00" "eval_cost": "40.00", "prefix_cost": "41.00" "data_read_per_join": "56K"} "used_columns": ["actor_id", "first_name" "last_name", "last_update" "id"]} }} IO_COST = CLUSTER_INDEX_SIZE * PAGE_READ_TIME = 1 * 1 = 1 EVAL_COST = TABLE_ROWS*EVALUATE_COST = 200 * 0.2 = 40 X Prefix costs = IO_COST + EVAL_COST

Note that the above process ignores the difference in access costs between memory pages and disk pages.

5.2 use full table scan when joining tables

SELECT * FROM sakila.actor a, sakila.film_actor bWHERE a.actor_id = b.actor_id

Db_name

Table_name

Last_update

N_rows

Cluster_index_size

Other_index_size

Sakila

Film_actor

2018-11-20 16:55:31

5462

twelve

five

The actor_id,film_id statistics in table film_actor are as follows:

Index_name

Last_update

Stat_name

Stat_value

Sample_size

Stat_description

PRIMARY

2018-11-14 14:25:49

N_diff_pfx01

two hundred

one

Actor_id

PRIMARY

2018-11-14 14:25:49

N_diff_pfx02

5462

one

Actor_id,film_id

PRIMARY

2018-11-14 14:25:49

N_leaf_pages

eleven

NULL

Number of leaf pages in the index

PRIMARY

2018-11-14 14:25:49

Size

twelve

NULL

Number of pages in the index

{"query_block": {"select_id": 1, "cost_info": {"query_cost": "1338.07"} "nested_loop": [{"table": {"table_name": "a", "access_type": "ALL" "possible_keys": ["PRIMARY"], "rows_examined_per_scan": 200,200 "rows_produced_per_join", "filtered": "100.00" "cost_info": {"read_cost": "1.00", "eval_cost": "40.00", "prefix_cost": "41.00", "data_read_per_join": "54K"} "used_columns": ["actor_id", "first_name", "last_name", "last_update"]}} {"table": {"table_name": "b", "access_type": "ref", "possible_keys": ["PRIMARY"] "key": "PRIMARY", "used_key_parts": ["actor_id"], "key_length": "2" "ref": ["sakila.a.actor_id"], "rows_examined_per_scan": 27, "rows_produced_per_join": 5461, "filtered": "100.00" "cost_info": {"read_cost": "204.67", "eval_cost": "1092.40", "prefix_cost": "1338.07", "data_read_per_join": "85K"} "used_columns": ["actor_id", "film_id" "last_update"]}]}}

The full table scan cost of the first table actor is 41, you can refer to example 1.

The second table is the NET LOOP cost:

Read_cost (204.67) = prefix_rowcount * (1 + keys_per_value/table_rows*cluster_index_size =

200 * (1 / 27 / 13 / 863 / 12) * 1

Note: 27 is equivalent to the index estimate for each actor_id,film_actor, with an average of 27 records per actor_id = 5462 prime 200

How is Table_rows calculated?

The actual number of records in the Film_ actor table is 5462, with a total of 12 page,11 leaf pages, with a total size of 11 pages 16K (default page size) = 180224Byte, a minimum record length of 26 (available by calculating the field length), and 13863 = 180224 impulse 26leaves 2. 2 is the security factor, and the worst cost estimate is made.

The number of rows returned by the table join = 200,5462 Universe 200, so the estimated cost of the rows is 5462 "0.2" 1902.4.

5.3 IN query

The index idx_id (film_id) statistics in table film_actor are as follows:

Index_name

Last_update

Stat_name

Stat_value

Sample_size

Stat_description

Idx_id

2018-11-14 14:25:49

N_diff_pfx01

nine hundred and ninety seven

four

Actor_id

Idx_id

2018-11-14 14:25:49

N_diff_pfx02

5462

four

Film_id,actor_id

Idx_id

2018-11-14 14:25:49

N_leaf_pages

four

NULL

Number of leaf pages in the index

Idx_id

2018-11-14 14:25:49

Size

five

NULL

Number of pages in the index

EXPLAIN SELECT * FROM ACTOR WHERE actor_id IN (SELECT film_id FROM film_actor) {"query_block": {"select_id": 1, "cost_info": {"query_cost": "460.79"} "nested_loop": [{"table": {"table_name": "ACTOR", "access_type": "ALL", "possible_keys": ["PRIMARY"] "rows_examined_per_scan": 200,200 "rows_produced_per_join", "filtered": "100.00", "cost_info": {"read_cost": "1.00", "eval_cost": "40.00" "prefix_cost": "41.00", "data_read_per_join": "56K"}, "used_columns": ["actor_id", "first_name", "last_name", "last_update" "id"]}}, {"table": {"table_name": "film_actor" "access_type": "ref", "possible_keys": ["idx_id"], "key": "idx_id", "used_key_parts": ["film_id"] "key_length": "2", "ref": ["sakila.ACTOR.actor_id"], "rows_examined_per_scan": 5, "rows_produced_per_join": 200, "filtered": "100.00" "using_index": true, "first_match": "ACTOR", "cost_info": {"read_cost": "200.66", "eval_cost": "40.00", "prefix_cost": "460.79" "data_read_per_join": "3K"}, "used_columns": ["film_id"] "attached_condition": "(`sakila`.`actor`.`actor _ id` = `sakila`.`film _ actor`.`film _ id`)"} ]}} id select_type table partitions type possible_keys key key_len ref rows filtered Extra- -1 SIMPLE ACTOR (NULL) ALL PRIMARY (NULL) (NULL) (NULL) 200 100.00 (NULL) 1 SIMPLE film_actor (NULL) ref idx_id Idx_id 2 sakila.ACTOR.actor_id 5 100.00 Using where Using index; FirstMatch (ACTOR)

As can be seen from the execution plan, MySQL adopts the FirstMatch approach. In MySQL, the semi-link optimization method is Materialization Strategy,LooseScan,FirstMatch,DuplicateWeedout. By default, all four optimization methods exist, and the selection method is based on the minimum COST. Now let's take FirstMatch as an example to explain the execution flow of the optimizer.

SQL is as follows:

Select * from Countrywhere Country.code IN (select City.Country from City where City.Population > 1 "1000" 1000) and Country.continent='Europe'

As you can see from the figure above, FirstMatch reduces the query and matching process by determining whether the record already exists in the result set.

For the access cost of table actor, please refer to example 1.

How is the access cost of 200.66 for the table film_actor calculated?

The index field film_id,MySQL in the access table film_actor will overwrite the index scan, that is, IDEX_ONLY_SCAN. How is the cost of an index access calculated?

Reference function double handler::index_only_read_time (uint keynr, double records)

The index block size is 16K, and MySQL assumes that all blocks are half full, then the number of index records that a block can hold is:

16K/2/ (index length + primary key length) = 16K/2/ (2x4) + 1x1366

Where the primary key is (actor_id,film_id), both fields are smallint, occupying 4 bytes, and the index idx_id (film_id) is 2 bytes, so the cost of accessing the index each time is: (5.47 million 1366-1) / 1366 = 1.0032, it takes a total of 200times to access the film_ actor table, and the total access cost is 2001.0032percent 200.66

Total cost 460.79 = table actor access cost + table film_actor access cost + row estimate cost =

The two 1s represent the filter factor, and since neither table has a filter condition, the filter factor is 1.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.