Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Discussion on MYSQL MRR NLJ BNL BKA from the principle of Sequential Random Imax O

2025-02-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Share

Shulou(Shulou.com)06/01 Report--

Discussion on MYSQL MRR NLJ BNL BKA from the principle of Sequential Random Imax O

This article only discusses the innodb storage engine, and some of the views are from the author's point of view, please point out if there is any mistake.

I. the principle of mechanical disk

The mechanical disk is composed of a moving arm, a disk, a read-write head and a spindle. The head is fixed and cannot be moved, and the corresponding sector can only be read through the disk.

Spin.

Each disk is double-sided, each side is distributed with concentric circles of the track, the track is divided into sectors generally 512 BYTES, modern disk

Generally speaking, the outer edge track has more sectors than the inner track, so the speed of reading and writing the outer edge track is faster because the rotational speed is fixed.

At the same time, the tracks with the same radius on different disks form a cylinder.

The following figure shows a typical disk organization (extracted data structure (C language version))

If we count ts (seek time) as the seek time, and tl (latency time) as the time to wait for the disk to rotate to a specific sector after the seek is completed.

Tw (transmission time) is the transfer time, so the time to read a sector is:

T (Imax 0) = ts+tl+tw

Obviously, when reading data is certain, the time of ts and tl becomes the decisive factor, but in fact, ts seek time takes longer than others.

The seek time is in the order of 10 milliseconds, and the tl time of 7200 rpm is 1000Universe 7200. it is about the order of 100 microseconds, and the transmission time is even shorter.

A large number of random Imax O will cause frequent track changes, resulting in too long a time, and it is likely that after reading a few sectors, you will soon jump to another track.

On the other hand, the sequential I _ map O can read more sectors at a time, thus minimizing the reading time.

2. Simulation of random and sequential Icantho.

The simulation is completed by calling LINUX API in C language, mainly in the following ways:

Reading a large file program is limited to 900m, while the program sequence and random reading of 20000 4096-size data, and CPY to other files

The file of cpy is 81920000 bytes

In order to reduce the impact of write operations, and magnify the impact of read operations, use the

O_CREAT | O_WRONLY | O_EXCL opens the write file. Enable OS BUFFER,write operation to write to OS kernel buffer, but cannot be opened at the same time

O_SYNC, start O_SYNC every time wirte calls fsync (), the impact of writing will be magnified.

O_RDONLY | O_DIRECT opens the read file and opens it with O_DIRECT. The purpose of opening it with O_DIRECT is to disable OS CACHE and, of course, disable pre-reading of OS to read files directly.

A picture taken from this aspect is easy to understand. In fact, I read this file after O_DIRECT, but the kernel caches it.

Of course, this program is a little complementary, I should use the sorting algorithm to sort the data in the random array and then read it, instead of taking a continuous array.

This is more telling, but it doesn't matter because random reading is ridiculously slow. The following is the result of my program.

. / a.out p10404530_112030_Linux-x86-64_1of7.zip

Fisrt sca array: 134709

Fisrt sca array: 198155

Fisrt sca array: 25305

Fisrt sca array: 46515

Fisrt sca array: 91550

Fisrt sca array: 137262

Fisrt sca array: 46134

Fisrt sca array: 10208

Fisrt sca array: 142115

.

Sequential cpy begin Time: Fri Dec 201: 36:55 2016

Begin cpy use sequential read buffer is 4k:

Per 25%, Time:Fri Dec 201: 36:56 2016

Per 50%, Time:Fri Dec 201: 36:57 2016

Per 75%, Time:Fri Dec 201: 36:57 2016

Per 100%, Time:Fri Dec 201: 36:58 2016

Scattered cpy begin Time: Fri Dec 201: 36:58 2016

Begin cpy use scattered read read buffer is 4k:

Per 25%, Time:Fri Dec 201: 37:51 2016

Per 50%, Time:Fri Dec 201: 38:40 2016

Per 75%, Time:Fri Dec 201: 39:29 2016

Per 100%, Time:Fri Dec 201: 40:20 2016

First output the random values in the partial array, and you can see that the reading position is random. To simulate random reading.

Then the output sequence reads and writes, and then random reads and writes are performed. You can see that there is a great difference in the use of

Iostat vmstat can see that the read speed is very slow. A comparison is given below:

-- order

Device: rrqm/s wrqm/s rUnip s wdeband s rkB/s wkB/s avgrq-sz avgqu-sz await svctm% util

Sda 0.00 0.00 4979.38 2.06 19967.01 32.99 8.03 0.76 0.15 0.14 70.21

Device: rrqm/s wrqm/s rUnip s wdeband s rkB/s wkB/s avgrq-sz avgqu-sz await svctm% util

Sda 0.00 0.00 7204.12 0.00 28816.49 0.00 8.00 0.98 0.14 0.14 98.04

Device: rrqm/s wrqm/s rUnip s wdeband s rkB/s wkB/s avgrq-sz avgqu-sz await svctm% util

Sda 0.00 9.09 7114.14 9.09 28456.57 96.97 8.02 1.04 0.15 0.13 95.86

-Random

Device: rrqm/s wrqm/s rUnip s wdeband s rkB/s wkB/s avgrq-sz avgqu-sz await svctm% util

Sda 0.00 0.00 107.14 0.00 428.57 0.00 8.00 1.01 9.49 9.42 100.92

Device: rrqm/s wrqm/s rUnip s wdeband s rkB/s wkB/s avgrq-sz avgqu-sz await svctm% util

Sda 0.00 0.00 104.17 1.04 416.67 0.52 7.93 1.04 9.79 9.81 103.23

Device: rrqm/s wrqm/s rUnip s wdeband s rkB/s wkB/s avgrq-sz avgqu-sz await svctm% util

Sda 0.00 0.00 104.12 2.06 465.98 32.99 9.40 1.17 11.02 9.68 102.78

The problem is clearly seen here, and the program is given at the end.

III. MRR

With the above foundation, we understand the importance of sequential reads, so let's look at Multi-Range Read (MRR), because this feature is also BKA.

Important pillar

If you know ORACLE, you will not forget the index cluster factor. Here is the description of the cluster factor. On the index analysis data, clustering_factor is a

A very important parameter that represents the relationship between the index and the table, because the indexes are arranged in a certain order, but for tables, they are arranged in a

Stored in the form of heap, each row may be distributed on any block on the segment, so if the data row is found through the index, there can be an index block.

Corresponding to many or even all the blocks of the table, the parameter clustering_factor is introduced to represent the corresponding relationship between the data storage and the index on the table. Like this

CBO can use this parameter to determine how much cost is generated by using this index.

Specific reference

(http://blog.itpub.net/7728585/viewspace-612691/)

1. Description

The secondary index of MYSQL also has the same problem as ORACLE clustering_factor. Returning to the table too randomly will cause the immediate reading to be too serious.

Range scan MYSQL in range access stores the scanned data in read_rnd_buffer_size and sorts them by primary key (rowid)

And then use the sorted data to return to the table in order because we know that the leaf node data in INNODB is arranged according to PRIMARY KEY (ROWID), so

In this way, random reads are converted to sequential reads.

In BKA, when the joined table is used in the ref,eq_ref index scan mode, the key value scanned in the first table is put into join_buffer_size.

Then the MRR interface is called for sorting and sequential access, and the data is obtained through the join condition, so that the connection conditions become sequential comparisons.

Source code interface handler::multi_range_read_init handler::multi_range_read_next

As shown in the picture (the picture is extracted from the official mariadb document):

2. Scope of application:

-range access: read data through one or more range restrictions.

such as

Mysql > set optimizer_switch='mrr_cost_based=off'

Query OK, 0 rows affected (0.00 sec)

Mysql > explain select * from testzh force index (id2) where id2=100 and id3 set optimizer_switch='mrr_cost_based=off,batched_key_access=on'

Query OK, 0 rows affected (0.00 sec)

Mysql > explain select * from testzh,testzh20 where testzh.id2=testzh20.id2

+- -+

| | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |

+- -+

| | 1 | SIMPLE | testzh | NULL | ALL | id2,id2_2 | NULL | NULL | NULL | 100031 | 100.00 | Using where |

| | 1 | SIMPLE | testzh20 | NULL | ref | id2 | id2 | 5 | test.testzh.id2 | 1 | 100.00 | Using join buffer (Batched Key Access) |

+- -+

2 rows in set, 1 warning (0.00 sec)

As shown in the picture (the picture is extracted from the official mariadb document):

3. Advantages:

Sequential reading no longer requires the constant movement of the disk arm to seek and wait for the disk to rotate.

Sequential reading has the advantage of pre-reading.

Each block value needs to be read once rather than multiple times, which is taken for granted, because multiple rows of data are generally stored in a block, and not using MRR may cause the same block to be read multiple times.

4. Disadvantages:

Sorting takes extra time. If you use order limit n, it will result in slower

4. NLJ, BNL, BKA

1. A simple nested-loop join (NLJ)

Use the description of the pseudo code in the MYSQL document:

For each row in t1 matching range {

For each row in t2 matching reference key {

For each row in t3 {

If row satisfies join conditions

Send to client

}

}

}

This approach uses a layer-by-layer call loop, which is obviously suitable for any scenario, but it is extremely inefficient when the joined table does not have an index.

Mysql > set optimizer_switch = 'block_nested_loop=off'

Query OK, 0 rows affected (0.00 sec)

-ref index scan

Mysql > explain select * from testzh force index (id2), testzh20 where testzh.id2=testzh20.id2

+- -- +

| | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |

+- -- +

| | 1 | SIMPLE | testzh20 | NULL | ALL | id2 | NULL | NULL | NULL | 99900 | 100.00 | Using where |

| | 1 | SIMPLE | testzh | NULL | ref | id2 | id2 | 5 | test.testzh20.id2 | 1 | 100.00 | NULL |

+- -- +

2 rows in set, 1 warning (0.00 sec)

-ALL full table scan

Mysql > explain select * from testzh force index (id2), testzh20 where testzh.name=testzh20.name

+-- +

| | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |

+-- +

| | 1 | SIMPLE | testzh20 | NULL | ALL | NULL | NULL | NULL | NULL | 99900 | 100.00 | NULL |

| | 1 | SIMPLE | testzh | NULL | ALL | NULL | NULL | NULL | NULL | 100031 | 10.00 | Using where |

+-- +

2 rows in set, 1 warning (0.00 sec)

Summary: generally used for linked table indexed way ref or eq_ref, and the cost of BKA is high, the connected table without index generally uses BNL.

2. A Block Nested-Loop (BNL)

Obviously, this method is generally used to optimize that the joined table has no index, and the join table mode is ALL or index, because this kind of join uses NLJ too slowly, so it needs to optimize its inner table.

The number of times the driver is driven, then add a cache called join_buffer_size

Let's consider the way two tables are joined, as follows:

Mysql > explain select * from testzh, testzh20 where testzh.name=testzh20.name

+- -- +

| | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |

+- -- +

| | 1 | SIMPLE | testzh20 | NULL | ALL | NULL | NULL | NULL | NULL | 99900 | 100.00 | NULL |

| | 1 | SIMPLE | testzh | NULL | ALL | NULL | NULL | NULL | NULL | 100031 | Using where; Using join buffer (Block Nested Loop) |

+- -- +

2 rows in set, 1 warning (0.00 sec)

Scan the relevant fields of the testzh20 table, at least testzh20.name contains them, and store them in join_buffer_size

When the join_buffer_size is full, yes

Testzh performs a scan, and the scanned value and the data in join_buffer_size are conditionally testzh.name=testzh20.name

Make a comparison, if it matches, put it back, if it does not match, discard it. After the comparison is completed, clear the buffer and continue to scan the rest of the testzh20 data.

Then read and compare the data in testzh and join_buffer_size according to the conditional testzh.name=testzh20.name again and again until

Finally, we consider that this optimization does reduce the number of testzh table reads compared to BNL, and is greatly reduced, so why not use the

What about the comparison after sorting by MRR? Because testzh doesn't use indexes at all, the data in your join_buffer_size is sorted, but the testzh table is closed.

The data on testzh20.name is still unordered and meaningless. The principle of using MRR is similar to merge sorting, and both data sets must be.

It's sorted, so you don't need MRR here.

Summary: generally used to join tables in the form of ALL or index (index override scan)

3. Batched Key Access Joins (BKA)

This approach has been talked about a lot in MRR, and its purpose is to optimize the way ref or eq_ref uses NLJ connections.

Mysql > explain select * from testzh, testzh20 where testzh.id2=testzh20.id2

+- -+

| | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |

+- -+

| | 1 | SIMPLE | testzh | NULL | ALL | id2 | NULL | NULL | NULL | 100031 | 100.00 | Using where |

| | 1 | SIMPLE | testzh20 | NULL | ref | id2 | id2 | 5 | test.testzh.id2 | 1 | 100.00 | Using join buffer (Batched Key Access) |

+- -+

2 rows in set, 1 warning (0.00 sec)

The principle of this method is to reduce the possibility of random reading through the comparison of sequential connection conditions. Can refer to

Https://mariadb.com/kb/en/mariadb/block-based-join-algorithms/

Batch Key Access Join

5. A brief description of the connection mode in ORACLE

1. NEST LOOP JOIN: when the result set of the driven table is few, and the driven table has an index, the principle of adding VECTOR Imaco after 11G is similar to MYSQL BKA, which reads multiple single blocks.

Merge to get better NEST LOOP performance.

2. SORT MERGE JOIN: this connection type uses merging, since merging is generally used for sorting between join conditions, and when there is an index.

3. Hash join: it is only used for CBO. Since HASH algorithm can only be used for equivalent join, it is suitable for joining between small and large tables, and the nest loop effect is not good.

The HASH algorithm is used for the small table according to the join field, and then the distribution uniformity of the hash of the HASH algorithm is determined by the different values of the join field of the small table using the hash algorithm.

It also directly affects the performance of HASH JOIN. For example, the self-increasing sequence is a good HASH connection field. The hash of large tables will take up too much temporary table space, which needs to be noted.

Of.

Reference: UNIX system programming manual

Block-Based Join Algorithms (mariadb)

Multi Range Read Optimization (mariadb)

Data structure (C language version)

MySQL 5.7 Reference Manual

Write code randomly and sequentially

Click (here) to collapse or open

/ *

> File Name: seqsca.c

> Author: gaopeng

> Mail: gaopp_200217@163.com

> Created Time: Wed 21 Dec 2016 02:24:04 PM CST

* * /

# include

# include

# include

# include

# include

# include

# include

# include

# include

# define _ _ USE_GNU 1

Static int iTunes 1

Void eprintf (const char* outp)

{

Write (STDOUT_FILENO,outp,strlen (outp))

ITunes +

Exit (I)

}

Void ffprintf (const char* outp)

{

Write (STDOUT_FILENO,outp,strlen (outp))

}

Void eperror (void)

{

Perror ("error")

ITunes +

Exit (I)

}

Int main (int argc,char* argv [])

{

Int sca [20001]; / / sca [0] not used

Char * buffer=NULL

Char * ct=NULL

Int i

Int openwflags = O_CREAT | O_WRONLY | O_EXCL; / / | O_SYNC no sync used os buffer max performance for wirte

Int openrflags = O_RDONLY | OneDIRECT; / / open use o_direct not use os cache disable preread

Int fdr

Int fdwse,fdwsc

Off_t file_size

Time_t t1

Time_t T2 Splash time used return

Int m

Time (& T1)

Srand (T1)

If (argc! = 2 | | strcmp (argv [1], "--help") = = 0)

{

Eprintf (". / tool readfile\ n")

}

Buffer=valloc (4096); / / one block 4K allocate aligned memory

/ / init data array

For (iTunes 1)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Database

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report