In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)05/31 Report--
This article mainly explains "how to understand the join function of MySQL". The content in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "how to understand the join function of MySQL".
Text
In the daily database query, we often have to join multiple tables to get the merged data of multiple tables at one time, which is the join syntax of the database. Join is a very common operation to merge two datasets in the data field. If you know more about it, you will find that both MySQL,Oracle,PostgreSQL and Spark support this operation. The protagonist of this article is MySQL, which, if not specified below, takes the join of MySQL as the subject. On the other hand, Oracle, PostgreSQL and Spark can be regarded as large boss, and their algorithm optimization and implementation of join is better than that of MySQL.
MySQL's join has many rules, which may be a little careless. a bad join statement may not only lead to a full table query against a table, but also affect the database cache, resulting in most of the hot data being replaced, dragging down the performance of the whole database.
Therefore, the industry has summed up many specifications or principles for MySQL join, such as small tables driving large tables and prohibiting join operations with more than three tables. Next we will introduce the algorithm of MySQL join in turn, compare it with the join implementation of Oracle and Spark, and interspersed it with answers to why the above specifications or principles are formed.
For the implementation of join operation, there are probably three common algorithms: Nested Loop Join (cyclic nested join), Hash Join (Hash join) and Sort Merge Join (sort merge join), each of which has its own advantages and disadvantages and applicable conditions, which we will introduce in turn.
Nested Loop Join implementation in MySQL
Nested Loop Join is a scan-driven table. Every time a record is read, the corresponding data is queried in the driven table according to the index on the associated field of the join. It is suitable for scenarios with a small subset of connected data, and it is the only algorithm implementation of MySQL join, which we will explain in detail next.
There are two variants of Nested Loop Join algorithm in MySQL, which are Index Nested-Loop Join and Block Nested-Loop Join.
Index Nested-Loop Join algorithm
Next, let's initialize the relevant table structure and data.
CREATE TABLE `t1` (`id` int (11) NOT NULL, `a` int (11) DEFAULT NULL, `b` int (11) DEFAULT NULL, PRIMARY KEY (`id`), KEY `a` (`a`) ENGINE=InnoDB; delimiter;; # define stored procedures to initialize T1 create procedure init_data () begin declare i int; set iSecret1; while (I = 100)
Join_buffer is not infinite, it is controlled by join_buffer_size, and the default value is 256K. When the data to be stored is too large, it can only be stored in segments, and the whole execution process becomes:
Scan table T2 and deposit qualified rows in join_buffer. Because of its limited size, when it is full at 100 rows, perform the second step.
Scan table T1 and compare each row of data with the data in join_buffer. If the join condition is met, it will be put into the result set.
Clear join_buffer
Perform the first step again until all the data is scanned, and because there are 500 rows of data in the T2 table, it is repeated a total of 5 times.
This process reflects the origin of Block in the name of the algorithm, which is divided into blocks to perform join operations. Because the data of table T2 is divided into five times and stored in join_buffer, table T1 has to be scanned five times.
All saved in 5 times in memory operation 10000 * 50010000 * (100 * 5) scan lines 10000 + 50010000 * 5 + 500
As shown above, compared with table data that can all be stored in join_buffer, the number of memory judgments does not change, which is the product of the number of rows of two tables, that is, 10000 * 500. however, the driven table will be scanned many times, and each time it is saved, the driven table will be scanned again, affecting the final execution efficiency.
Based on the above two algorithms, we can draw the following conclusion, which is also the specification of most MySQL join statements on the Internet.
The join operation can be used when there is an index on the driven table, that is, when the Index Nested-Loop Join algorithm can be used.
Both the Index Nested-Loop Join algorithm and Block Nested-Loop Join use a small table as the driver table.
Because the time complexity of the above two join algorithms is at least first-order with the number of rows involved in the table, and costs a lot of memory space, it is understandable that the Ali developer specification strictly forbids join operations with more than three tables.
But these two algorithms are only one of the join algorithms, and there are more efficient join algorithms, such as Hash Join and Sorted Merged join. Unfortunately, neither of these two algorithms is currently available in the mainstream version of MySQL, while Oracle, PostgreSQL and Spark are supported, which is the reason for online complaints about the weakness of MySQL (MySQL version 8.0 supports Hash join, but 8.0 is not the mainstream version yet).
In fact, the Ali developer specification also forbids join operations with more than three tables when migrating from Oracle to MySQL because the join operation performance of MySQL is too poor.
Hash Join algorithm
Hash Join is a scan-driven table, which uses the associated fields of join to establish a hash table in memory, then scans the driven table, reads each row of data, and finds the corresponding data from the hash table. It is a common way for big data set join operation, which is suitable for scenarios where the amount of data driving tables is small and can be put into memory. It can provide the best performance for large tables without indexes and parallel queries. Unfortunately, it only applies to scenarios with equivalent connections, such as on a.id = where b.a_id.
It is also the statement of the above two tables join, and its execution process is as follows
Take out the qualified data from the driver table T2, hash the join field values of each row, and then store them in a hash table in memory.
Traverse the driven table T1, fetch a row of eligible data, hash its join field value, and find a match in the hash table in memory. If found, it becomes part of the result set.
It can be seen that this algorithm is similar to Block Nested-Loop Join, except that the unordered Join Buffer is changed into hash table hash table, so that data matching no longer needs to traverse all the data in join buffer, but directly through hash to obtain matching rows with a time complexity close to O (1), which greatly improves the join speed of the two tables.
However, due to the characteristics of hash, the algorithm can only be applied to equivalent connection scenarios, and other connection scenarios can not use this algorithm.
Sorted Merge Join algorithm
Sort Merge Join first sorts the two tables according to the associated fields of the join (if they have already been sorted, for example, there is no need to sort them again if there is an index on the field), and then merges the two tables once. If the two tables are already sorted, there is no need to sort when performing a sort merge join, and the performance of Merge Join is better than Hash Join. Merge Join can be adapted to non-equivalent Join (>, =
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.